UForm
Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
Install / Use
/learn @unum-cloud/UFormREADME

Welcome to UForm, a multimodal AI library that's as versatile as it is efficient. UForm tiny embedding models will help you understand and search visual and textual content across various languages. UForm small generative models, on the other hand, don't only support conversational and chat use-cases, but are great for fast image captioning and Visual Question Answering (VQA). With compact custom pre-trained transformer models, this can run anywhere from your server farm down to your smartphone.
Features
- Tiny Embeddings: 64-dimensional Matryoshka-style embeddings for extremely fast search.
- Throughput: Thanks to the small size, the inference speed is 2-4x faster than competitors.
- Portable: Models come with native ONNX support, making them easy to deploy on any platform.
- Quantization Aware: Down-cast embeddings from
f32toi8without losing much recall. - Multilingual: Trained on a balanced dataset, the recall is great across over 20 languages.
Models
For accuracy and speed benchmarks refer to the evaluation page.
Embedding Models
<table style="width:100%; border-collapse:collapse;"> <thead> <tr> <th>Model</th> <th style="text-align:right;">Parameters</th> <th style="text-align:right;">Languages</th> <th style="text-align:right;">Architecture</th> </tr> </thead> <tbody> <tr> <td><code><a href="https://huggingface.co/unum-cloud/uform-vl-english-large/">uform3-image-text-english-large</a></code> 🆕</td> <td style="text-align:right;">365 M</td> <td style="text-align:right;">1</td> <td style="text-align:right;">12 layer BERT, ViT-L/14</td> </tr> <tr> <td><code><a href="https://huggingface.co/unum-cloud/uform-vl-english/">uform3-image-text-english-base</a></code></td> <td style="text-align:right;">143 M</td> <td style="text-align:right;">1</td> <td style="text-align:right;">4 layer BERT, ViT-B/16</td> </tr> <tr> <td><code><a href="https://huggingface.co/unum-cloud/uform-vl-english-small/">uform3-image-text-english-small</a></code> 🆕</td> <td style="text-align:right;">79 M</td> <td style="text-align:right;">1</td> <td style="text-align:right;">4 layer BERT, ViT-S/16</td> </tr> <tr> <td><code><a href="https://huggingface.co/unum-cloud/uform-vl-multilingual-v2/">uform3-image-text-multilingual-base</a></code></td> <td style="text-align:right;">206M</td> <td style="text-align:right;">21</td> <td style="text-align:right;">12 layer BERT, ViT-B/16</td> </tr> </tbody> </table>Generative Models
<table style="width:100%; border-collapse:collapse;"> <thead> <tr> <th>Model</th> <th style="text-align:right;">Parameters</th> <th style="text-align:right;">Purpose</th> <th style="text-align:right;">Architecture</th> </tr> </thead> <tbody> <tr> <td><code><a href="https://huggingface.co/unum-cloud/uform-gen2-dpo/">uform-gen2-dpo</a></code> 🆕</td> <td style="text-align:right;">1.2 B</td> <td style="text-align:right;">Chat, Image Captioning, VQA</td> <td style="text-align:right;">qwen1.5-0.5B, ViT-H/14</td> </tr> <tr> <td><code><a href="https://huggingface.co/unum-cloud/uform-gen2-qwen-500m/">uform-gen2-qwen-500m</a></code></td> <td style="text-align:right;">1.2 B</td> <td style="text-align:right;">Chat, Image Captioning, VQA</td> <td style="text-align:right;">qwen1.5-0.5B, ViT-H/14</td> </tr> <tr> <td><code><a href="https://huggingface.co/unum-cloud/uform-gen/">uform-gen</a></code> ⚠️</td> <td style="text-align:right;">1.5 B</td> <td style="text-align:right;">Image Captioning, VQA</td> <td style="text-align:right;">llama-1.3B, ViT-B/16</td> </tr> </tbody> </table>Quick Start Examples
Embedding Models
First, pip install uform.
Then, load the model:
from uform import get_model, Modality
# Defaults to `dtype='bfloat16'` for ~2x speedup with minimal accuracy loss
processors, models = get_model('unum-cloud/uform3-image-text-english-small', device='cuda')
model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]
Embed images:
import requests
from io import BytesIO
from PIL import Image
image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image = Image.open(BytesIO(requests.get(image_url).content))
image_data = processor_image(image)
image_features, image_embedding = model_image.encode(image_data, return_features=True)
Embed queries:
text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
text_data = processor_text(text)
text_features, text_embedding = model_text.encode(text_data, return_features=True)
For more details check out:
- Python docs on embedding models in python/README.md
- JavaScript docs on embedding models in javascript/README.md
- Swift docs on embedding models in swift/README.md
Generative Models
The generative models are natively compatible with
from transformers import AutoModel, AutoProcessor
model = AutoModel.from_pretrained('unum-cloud/uform-gen2-dpo', trust_remote_code=True)
processor = AutoProcessor.from_pretrained('unum-cloud/uform-gen2-dpo', trust_remote_code=True)
prompt = 'Question or Instruction'
image = Image.open('image.jpg')
inputs = processor(text=[prompt], images=[image], return_tensors='pt')
with torch.inference_mode():
output = model.generate(
**inputs,
do_sample=False,
use_cache=True,
max_new_tokens=256,
eos_token_id=151645,
pad_token_id=processor.tokenizer.pad_token_id
)
prompt_len = inputs['input_ids'].shape[1]
decoded_text = processor.batch_decode(output[:, prompt_len:])[0]
For more details check out:
- Python docs on generative models in python/README.md
- JavaScript docs on generative models 🔜
- Swift docs on generative models 🔜
Technical Details
Down-casting, Quantization, Matryoshka, and Slicing
Depending on the application, the embeddings can be down-casted to smaller numeric representations without losing much recall.
Switching from f32 to f16 is recommended in almost all cases, unless you are running on very old hardware without half-precision support.
Switching to i8 with linear scaling is also possible, but will be noticeable in the recall on larger collections with millions of searchable entries.
Similarly, for higher-dimensional embeddings (512 or 768), a common strategy is to quantize them into single-bit representations for faster search.
import numpy as np
f32_embedding: np.ndarray = model.encode_text(text_data, return_features=False)
f16_embedding: np.ndarray = f32_embedding.astype(np.float16)
i8_embedding: np.ndarray = (f32_embedding * 127).astype(np.int8)
b1_embedding: np.ndarray = np.packbits((f32_embedding > 0).astype(np.uint8))
Alternative approach to quantization is to use the Matryoshka embeddings, where the embeddings are sliced into smaller parts, and the search is performed in a hierarchical manner.
import numpy as np
large_embedding: np.ndarray = model.encode_text(text_data, return_features=False)
small_embedding: np.ndarray = large_embedding[:, :256]
tiny_embedding: n
