Docarray
Represent, send, store and search multimodal data
Install / Use
/learn @docarray/DocarrayREADME
Note The README you're currently viewing is for DocArray>0.30, which introduces some significant changes from DocArray 0.21. If you wish to continue using the older DocArray <=0.21, ensure you install it via
pip install docarray==0.21. Refer to its codebase, documentation, and its hot-fixes branch for more information.
DocArray is a Python library expertly crafted for the representation, transmission, storage, and retrieval of multimodal data. Tailored for the development of multimodal AI applications, its design guarantees seamless integration with the extensive Python and machine learning ecosystems. As of January 2022, DocArray is openly distributed under the Apache License 2.0 and currently enjoys the status of a sandbox project within the LF AI & Data Foundation.
- :fire: Offers native support for NumPy, PyTorch, TensorFlow, and JAX, catering specifically to model training scenarios.
- :zap: Based on Pydantic, and instantly compatible with web and microservice frameworks like FastAPI and Jina.
- :package: Provides support for vector databases such as **Weaviate, Qdrant, ElasticSearch, Redis, Mongo Atlas, and HNSWLib.
- :chains: Allows data transmission as JSON over HTTP or as Protobuf over gRPC.
Installation
To install DocArray from the CLI, run the following command:
pip install -U docarray
Note To use DocArray <=0.21, make sure you install via
pip install docarray==0.21and check out its codebase and docs and its hot-fixes branch.
Get Started
New to DocArray? Depending on your use case and background, there are multiple ways to learn about DocArray:
- Coming from pure PyTorch or TensorFlow
- Coming from Pydantic
- Coming from FastAPI
- Coming from Jina
- Coming from a vector database
- Coming from Langchain
Represent
DocArray empowers you to represent your data in a manner that is inherently attuned to machine learning.
This is particularly beneficial for various scenarios:
- :running: You are training a model: You're dealing with tensors of varying shapes and sizes, each signifying different elements. You desire a method to logically organize them.
- :cloud: You are serving a model: Let's say through FastAPI, and you wish to define your API endpoints precisely.
- :card_index_dividers: You are parsing data: Perhaps for future deployment in your machine learning or data science projects.
:bulb: Familiar with Pydantic? You'll be pleased to learn that DocArray is not only constructed atop Pydantic but also maintains complete compatibility with it! Furthermore, we have a specific section dedicated to your needs!
In essence, DocArray facilitates data representation in a way that mirrors Python dataclasses, with machine learning being an integral component:
from docarray import BaseDoc
from docarray.typing import TorchTensor, ImageUrl
import torch
# Define your data model
class MyDocument(BaseDoc):
description: str
image_url: ImageUrl # could also be VideoUrl, AudioUrl, etc.
image_tensor: TorchTensor[1704, 2272, 3] # you can express tensor shapes!
# Stack multiple documents in a Document Vector
from docarray import DocVec
vec = DocVec[MyDocument](
[
MyDocument(
description="A cat",
image_url="https://example.com/cat.jpg",
image_tensor=torch.rand(1704, 2272, 3),
),
]
* 10
)
print(vec.image_tensor.shape) # (10, 1704, 2272, 3)
<details markdown="1">
<summary>Click for more details</summary>
Let's take a closer look at how you can represent your data with DocArray:
from docarray import BaseDoc
from docarray.typing import TorchTensor, ImageUrl
from typing import Optional
import torch
# Define your data model
class MyDocument(BaseDoc):
description: str
image_url: ImageUrl # could also be VideoUrl, AudioUrl, etc.
image_tensor: Optional[
TorchTensor[1704, 2272, 3]
] = None # could also be NdArray or TensorflowTensor
embedding: Optional[TorchTensor] = None
So not only can you define the types of your data, you can even specify the shape of your tensors!
# Create a document
doc = MyDocument(
description="This is a photo of a mountain",
image_url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg",
)
# Load image tensor from URL
doc.image_tensor = doc.image_url.load()
# Compute embedding with any model of your choice
def clip_image_encoder(image_tensor: TorchTensor) -> TorchTensor: # dummy function
return torch.rand(512)
doc.embedding = clip_image_encoder(doc.image_tensor)
print(doc.embedding.shape) # torch.Size([512])
Compose nested Documents
Of course, you can compose Documents into a nested structure:
from docarray import BaseDoc
from docarray.documents import ImageDoc, TextDoc
import numpy as np
class MultiModalDocument(BaseDoc):
image_doc: ImageDoc
text_doc: TextDoc
doc = MultiModalDocument(
image_doc=ImageDoc(tensor=np.zeros((3, 224, 224))), text_doc=TextDoc(text='hi!')
)
You rarely work with a single data point at a time, especially in machine learning applications. That's why you can easily collect multiple Documents:
Collect multiple Documents
When building or interacting with an ML system, usually you want to process multiple Documents (data points) at once.
DocArray offers two data structures for this:
DocVec: A vector ofDocuments. All tensors in the documents are stacked into a single tensor. Perfect for batch processing and use inside of ML models.DocList: A list ofDocuments. All tensors in the documents are kept as-is. Perfect for streaming, re-ranking, and shuffling of data.
Let's take a look at them, starting with DocVec:
from docarray import DocVec, BaseDoc
from docarray.typing import AnyTensor, ImageUrl
import numpy as np
class Image(BaseDoc):
url: ImageUrl
tensor: AnyTensor # this allows torch, numpy, and tensor flow tensors
vec = DocVec[Image]( # the DocVec is parametrized by your personal schema!
[
Image(
url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg",
tensor=np.zeros((3, 224, 224)),
)
for _ in range(100)
]
)
In the code snippet above, DocVec is parametrized by the type of document you want to use with it: DocVec[Image].
This may look weird at first, but we're confident that you'll get used to it quickly! Besides, it lets us do some cool things, like having bulk access to the fields that you defined in your document:
tensor = vec.tensor # gets all the tensors in the DocVec
print(tensor.shape) # which are stacked up into a single tensor!
print(vec.url) # you can bulk access any other field, too
The second data structure, DocList, works in a similar way:
from docarray import DocList
dl = DocList[Image]( # the DocList is parametrized by your personal schema!
[
Image(
url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg",
tensor=np.zeros((3, 224, 224)),
)
for _ in range(100)
]
)
You can still bulk access the fields of your document:
tensors = dl.tensor # gets all the tensors in the DocList
print(type(tensors)) # as a list of tensors
print(dl.url) # you can bulk access any other field, too
And you can insert, remove, and append documents to your DocList:
# append
dl.append(
Image(
url="https://upload.wikimedia.org/wikipedia/commons/2/2f/Alpamayo.jpg",
tensor=np.zeros((3, 224, 224)),
)
)
# delete
del dl[0]
# insert
dl.insert(
0,
Image(
url="http
Related Skills
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
codebase-to-course
Turn any codebase into a beautiful, interactive single-page HTML course that teaches how the code works to non-technical people. Use this skill whenever someone wants to create an interactive course, tutorial, or educational walkthrough from a codebase or project. Also trigger when users mention 'turn this into a course,' 'explain this codebase interactively,' 'teach this code,' 'interactive tutorial from code,' 'codebase walkthrough,' 'learn from this codebase,' or 'make a course from this project.' This skill produces a stunning, self-contained HTML file with scroll-based navigation, animated visualizations, embedded quizzes, and code-with-plain-English side-by-side translations.
academic-pptx
Use this skill whenever the user wants to create or improve a presentation for an academic context — conference papers, seminar talks, thesis defenses, grant briefings, lab meetings, invited lectures, or any presentation where the audience will evaluate reasoning and evidence. Triggers include: 'conference talk', 'seminar slides', 'thesis defense', 'research presentation', 'academic deck', 'academic presentation'. Also triggers when the user asks to 'make slides' in combination with academic content (e.g., 'make slides for my paper on X', 'create a presentation for my dissertation defense', 'build a deck for my grant proposal'). This skill governs CONTENT and STRUCTURE decisions. For the technical work of creating or editing the .pptx file itself, also read the pptx SKILL.md.
