OpenVision Series

<img src="assets/icon.png" width="24"> OpenVision (ICCV 2025)

A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

<img src="assets/openvision_v1.5_logo.png" width="24"> OpenVision 2 (CVPR 2026)

A Family of Generative Pretrained Visual Encoders for Multimodal Learning

<img src="assets/openvision_3_logo.png" width="24"> OpenVision 3

A Unified Visual Encoder for Both Understanding and Generation

🌐 <a href="https://ucsc-vlaa.github.io/OpenVision/" target="_blank">OpenVision Project Page</a> • <img src="./assets/ar.svg" alt="Arxiv Logo" style="height: 1em; vertical-align: middle; margin-right: 0.3em;"> <a href="https://arxiv.org/abs/2505.04601" target="_blank">Arxiv</a> • 💻 <a href="https://github.com/UCSC-VLAA/OpenVision" target="_blank">Code</a> • <img src="./assets/hg.svg" alt="Hugging Face Logo" style="height: 1em; vertical-align: middle; margin-right: 0.3em;"> <a href="https://huggingface.co/collections/UCSC-VLAA/openvision-681a4c27ee1f66411b4ae919" target="_blank">OpenVision Collection</a> 🌐 <a href="https://ucsc-vlaa.github.io/OpenVision2/" target="_blank">OpenVision 2 Project Page</a> • <img src="./assets/ar.svg" alt="Arxiv Logo" style="height: 1em; vertical-align: middle; margin-right: 0.3em;"> <a href="https://arxiv.org/abs/2509.01644" target="_blank">Arxiv</a> • 💻 <a href="https://github.com/UCSC-VLAA/OpenVision" target="_blank">Code</a> • <img src="./assets/hg.svg" alt="Hugging Face Logo" style="height: 1em; vertical-align: middle; margin-right: 0.3em;"> <a href="https://huggingface.co/collections/UCSC-VLAA/openvision-2-68ab5934fe21f3fc463077da" target="_blank">OpenVision 2 Collection</a> 🌐 <a href="https://ucsc-vlaa.github.io/OpenVision3/" target="_blank">OpenVision 3 Project Page</a> • <img src="./assets/ar.svg" alt="Arxiv Logo" style="height: 1em; vertical-align: middle; margin-right: 0.3em;"> <a href="https://arxiv.org/abs/2601.15369" target="_blank">Arxiv</a> • 💻 <a href="https://github.com/UCSC-VLAA/OpenVision" target="_blank">Code</a> • <img src="./assets/hg.svg" alt="Hugging Face Logo" style="height: 1em; vertical-align: middle; margin-right: 0.3em;"> <a href="https://huggingface.co/collections/UCSC-VLAA/openvision-3" target="_blank">OpenVision 3 Collection</a>

This repository contains the code for training and fine-tuning vision-language models based on the OpenVision framework. It now supports both the original contrastive + generative training (OpenVision), the simplified caption-only generative training (OpenVision 2), providing efficient and scalable approaches to multimodal learning on TPU infrastructure.

🚀 Recent Updates

January 2026

✨ Released OpenVision 3: a unified visual encoder for both understanding and generation.
- Please refer to the script for OpenVision 3 usage.
- The full training code will be released soon.

September 2025

Released OpenVision 2: a simplified, generative-only version of OpenVision that removes the text encoder and contrastive loss, keeping only the captioning objective.
OpenVision 2 achieves:
- 1.5–2× faster training
- ~1.8× lower memory footprint
- Supports scaling up to 1B+ parameters
- Maintains or improves performance on multimodal benchmarks (OCR, TextVQA, ChartQA, MME, etc.).

May 2025

Released OpenVision models and training code.

🧩 OpenVision 2 at a Glance

Architecture: Vision Encoder (ViT) + Text Decoder (no text encoder)
Training Objective: Caption-only autoregressive generation
Key Optimizations:
- Dual-stage CLIPA-style training (low → high resolution)
- Synthetic captions from ReCap-DataComp-1B v2 (LLaMA-3-powered, conditioned on alt-text)
- Visual token masking (keep ~25–35% tokens) for efficiency
Efficiency:
- ViT-L/14 @224: Training time reduced from 83h → 57h, memory 24.5GB → 13.8GB
- SoViT-400M/14 @384: Training time 241h → 121h, memory 27.4GB → 14.5GB
- Enables larger batch size on TPU v4

📦 Core Features (Shared by OpenVision & OpenVision 2)

Optimized for Google Cloud TPU training
Supports various encoder architectures (ViT models of different sizes)
Implements efficient training strategies including model sharding
Supports pre-training and multi-stage fine-tuning
Compatible with CLIP-style vision-language training

📖 OpenVision (Original)

Training Objective: Contrastive (CLIP-style) + Generative (captioning)
Highlights:
- Strong performance across multimodal benchmarks
- Public release of both code and pretrained weights
- Serves as the foundation for OpenVision 2

📊 Model Zoo (OpenVision 2)

OpenVision 2 Performance on Multimodal Benchmarks

| Method | Vision Encoder | Params | Res | TextVQA | ChartQA | OCR | MME | SEED | SQA | GQA | POPE | |--------------------|--------------------|--------|-----|---------|---------|-----|-------|------|------|------|------| | OpenVision | L/14 | 304M | 224 | 57.7 | 13.9 | 315 | 1487 | 69.5 | 73.6 | 62.9 | 86.4 | | OpenVision 2 | L/14 | 304M | 224 | 59.0 | 13.7 | 327 | 1460 | 69.3 | 76.5 | 62.6 | 87.1 | | OpenVision | L/14 | 304M | 336 | 61.2 | 15.7 | 339 | 1525 | 70.5 | 75.1 | 63.7 | 87.2 | | OpenVision 2 | L/14 | 304M | 336 | 63.0 | 14.5 | 357 | 1486 | 70.1 | 77.5 | 63.0 | 87.7 | | OpenVision | SoViT-400M/14 | 400M | 384 | 62.4 | 16.1 | 357 | 1493 | 70.4 | 72.4 | 63.8 | 88.0 | | OpenVision 2 | SoViT-400M/14 | 400M | 384 | 64.3 | 15.0 | 387 | 1472 | 70.7 | 74.9 | 63.5 | 87.5 | | OpenVision 2 | H/14 | 632M | 224 | 60.2 | 13.5 | 340 | 1470 | 69.3 | 75.4 | 62.5 | 87.2 | | OpenVision 2 | H/14 | 632M | 336 | 63.4 | 16.3 | 391 | 1470 | 70.6 | 76.4 | 63.1 | 88.4 | | OpenVision 2 | H/14 | 632M | 448 | 65.6 | 18.1 | 416 | 1499 | 70.6 | 75.6 | 63.1 | 88.7 | | OpenVision 2 | g/14 | 1.01B | 224 | 60.2 | 13.7 | 338 | 1469 | 69.3 | 75.0 | 62.6 | 86.9 |

Full collection: Hugging Face – OpenVision 2

🔧 How to Load Converted Vision Encoder

Note:
OpenVision2 checkpoints require the custom open_clip version included in this repository.
The upstream open_clip pip package is not compatible.

Example

import torch
# Use the OpenVision2 version of open_clip
from src.convert_upload.open_clip.factory import create_vision_encoder_and_transforms

hf_repo = "UCSC-VLAA/openvision2-vit-large-patch14-224-vision-only"

vision_encoder = create_vision_encoder_and_transforms(
    model_name=f"hf-hub:{hf_repo}"
)

vision_encoder.eval()
dummy_image = torch.ones((1, 3, 224, 224))
with torch.no_grad():
    _, patch_features = vision_encoder(dummy_image)

print("Patch feature shape:", patch_features.shape)

📊 Model Zoo (OpenVision)

Vision Encoder Performance on ImageNet-1K

| Model | Size | Patch Size | Resolution | IN-1K Top-1 | JAX Weight | PyTorch Weight | |-------|------|------------|------------|-------------|------------|----------------| | OpenVision-ViT-Tiny | 5M | 16 | 160 | 46.9% | Available | Available | | OpenVision-ViT-Tiny | 5M | 16 | 224 | 49.6% | Available | Available | | OpenVision-ViT-Tiny | 5M | 16 | 384 | 51.5% | Available | Available | | OpenVision-ViT-Tiny | 5M | 8 | 160 | 51.9% | Available | Available | | OpenVision-ViT-Tiny | 5M | 8 | 224 | 53.5% | Available | Available | | OpenVision-ViT-Tiny | 5M | 8 | 384 | 53.9% | Available | Available | | OpenVision-ViT-Small | 22M | 16 | 160 | 63.5% | Available | Available | | OpenVision-ViT-Small | 22M | 16 | 224 | 65.9% | Available | Available | | OpenVision-ViT-Small | 22M | 16 | 384 | 67.1% | Available | Available | | OpenVision-ViT-Small | 22M | 8 | 160 | 67.3% | [Available](https://huggingface.co/UCSC-VLAA/openvision-vit-smal

OpenVision

Install / Use

README