Unibench
Python Library to evaluate VLM models' robustness across diverse benchmarks
Install / Use
/learn @facebookresearch/UnibenchREADME
|
|
| :------------------------------------------------: |
| [Arxiv link] |
Vision-Language Model Evaluation Repository
This repository is designed to simplify the evaluation process of vision-language models. It provides a comprehensive set of tools and scripts for evaluating VLM models and benchmarks. We offer 60+ VLMs, inclusive of recent large-scale models like EVACLIP, with scales reaching up to 4.3B parameters and 12.8B training samples. Additionally, we provide implementations for 40+ evaluation benchmarks.
News and Updates
For the latest news and updates, see the snippet below.
April 15, 2025 - v0.4.0
- Removed FaceNet from required libraries.
- Added SigLIP2 models
- Added bivlc benchmark
- Created benchmark_builder for future benchmark implementations
- Added News & Updates section in README
- Fixed Sun397 benchmark
For full details, refer to the UPDATES.md file.
Coming Soon
- [ ] L-VLM (e.g. PaliGemma, LlavaNext)
Getting Started
Choose the UniBench installation that best fits your use case:
🔧 Standard Installation
For full functionality including evaluation, visualization, and analysis:
pip install unibench[all]
📊 Minimal Version
Best for: Analyzing existing results without running new evaluations
pip install unibench
<details>
<summary><b>What's included:</b></summary>
- Download existing benchmark results
- Visualize performance data with charts and graphs
- Load results into pandas DataFrames for analysis
- Compare model performance across benchmarks
- Minimal dependencies for faster installation
For detailed usage, see the minimal installation guide.
</details>🤖 New Model Evaluation
Best for: Testing your models against UniBench benchmarks
pip install unibench[new_model]
<details>
<summary><b>What's included:</b></summary>
- Evaluate HuggingFace models on all UniBench benchmarks
- Test custom vision-language models
- Add new model architectures to the evaluation pipeline
- Support for CLIP, BLIP, and other VLM architectures
- Comprehensive model performance analysis
For detailed usage, see the new model evaluation guide.
</details>📋 New Benchmark Evaluation
Best for: Adding custom datasets and benchmarks
pip install unibench[new_benchmark]
<details>
<summary><b>What's included:</b></summary>
- Add custom datasets as new benchmarks
- Evaluate all UniBench models on your benchmark
- Support for classification, detection, and custom tasks
- Flexible benchmark integration framework
- Contribute new evaluation tasks to the community
For detailed usage, see the new benchmark evaluation guide.
</details>🚀 Quick Start
After installation, verify your setup:
# List available models and benchmarks
unibench list_models
unibench list_benchmarks
# View existing results (all versions)
unibench show_results
# Run evaluation (standard installation only)
unibench evaluate
Usage
Print out Results from Evaluated Models
The following command will print the results of the evaluations on all benchmarks and models:
unibench show_results
Run Evaluation using Command Line
The following command will run the evaluation on all benchmarks and models:
unibench evaluate
Run Evaluation using Custom Script
The following command will run the evaluation on all benchmarks and models:
import unibench as vlm
evaluator = vlm.Evaluator()
evaluator.evaluate()
Arguments for Evaluation
evaluate function takes the following arguments:
Args:
save_freq (int): The frequency at which to save results. Defaults to 1000.
face_blur (bool): Whether to use face blurring during evaluation. Defaults to False.
device (str): The device to use for evaluation. Defaults to "cuda" if available otherwise "cpu".
batch_per_gpu (int): Evaluation batch size per GPU. Defaults to 32.
The Evaluator class takes the following arguments:
Args:
seed (int): Random seed for reproducibility.
num_workers (int): Number of workers for data loading.
models (Union[List[str], str]): List of models to evaluate or "all" to evaluate all available models.
benchmarks (Union[List[str], str]): List of benchmarks to evaluate or "all" to evaluate all available benchmarks.
model_id (Union[int, None]): Specific model ID to evaluate.
benchmark_id (Union[int, None]): Specific benchmark ID to evaluate.
output_dir (str): Directory to save evaluation results.
benchmarks_dir (str): Directory containing benchmark data.
download_aggregate_precomputed (bool): Whether to download aggregate precomputed results.
download_all_precomputed (bool): Whether to download all precomputed results.
Example
The following command will run the evaluation for openclip_vitB32 trained on metaclip400m and CLIP ResNet50 on vg_relation,clevr_distance,pcam,imageneta benchmarks:
unibench evaluate --models=[openclip_vitB32_metaclip_400m,clip_resnet50] --benchmarks=[vg_relation,clevr_distance,pcam,imageneta]
In addition to saving the results in ~/.cache/unibench, the output would be a summary of the evaluation results:
model_name non-natural images reasoning relation robustness
────────────────────────────────────────────────────────────────────────────────────────
clip_resnet50 63.95 14.89 54.13 23.27
openclip_vitB32_metaclip_400m 63.87 19.46 51.54 28.71
Supported Models and benchmarks
Full list of models and benchmarks are available in the models_zoo and benchmarks_zoo. You are also able to run the following commands:
unibench list_models
# or
unibench list_benchmarks
Sample Models
| | Dataset Size (Million) | Number of Parameters (Million) | Learning Objective | Architecture | Model Name | | :----------------- | ---------------------: | -----------------------------: | :----------------- | :----------- | :------------ | | blip_vitB16_14m | 14 | 86 | BLIP | vit | BLIP ViT B 16 | | blip_vitL16_129m | 129 | 307 | BLIP | vit | BLIP ViT L 16 | | blip_vitB16_129m | 129 | 86 | BLIP | vit | BLIP ViT B 16 | | blip_vitB16_coco | 129 | 86 | BLIP | vit | BLIP ViT B 16 | | blip_vitB16_flickr | 129 | 86 | BLIP | vit | BLIP ViT B 16 |
Sample benchmarks
| | benchmark | benchmark_type | | :------------- | :-------- | :------------- | | clevr_distance | zero-shot | vtab | | fgvc_aircraft | zero-shot | transfer | | objectnet | zero-shot | robustness | | winoground | relation | relation | | imagenetc | zero-shot | corruption |
benchmarks Overview
| benchmark type | number of benchmarks | | :------------- | :------------------: | | ImageNet | 1 | | vtab | 18 | | transfer | 7 | | robustness | 6 | | relation | 6 | | corruption | 1 |
<!-- ## :sparkles: Features/Objectives - Ease-of-use VLM evaluation repo. - Evaluate existing and future VLMs on benchmarks without extensive code - Evaluate on existing and future benchmarks without extensive code ## :pencil2: Repository Structure The repository is organized into the following directories: - `common_utils`: Scripts for common utilities used throughout the repository. - `benchmarks_zoo`: Scripts for loading various benchmarks. - `models_zoo`: Scripts for loading various models. - `slurm_scripts`: Scripts for running the evaluation in parallel on a SLURM cluster. - `main.py`: Script for running the evaluation. - `plotter.py`: Script for plotting the results. - `output.py`: Script for saving the results. -->How results are saved
For each model, the results are saved in the output directory defined in constants: ~./.cache/unibench/outputs.
Add new Benchmark
To add new benchmark, you can simply inherit from the torch.utils.data.Dataset class and implement the __getitem__, and __len__ methods. For example, here is how to add ImageNetA as a new benchmark:
from functools import partial
from unibench import Evaluator
from unibench.benchmarks_zoo import ZeroShotBenchmarkHandler
from torchvision.datasets import FashionMNIST
class_names = [
"T-shirt/top",
"Trouser",
"Pullover",
"Dress",
"Coat",
"Sandal",
"Shirt",
"Sneaker",
"Bag",
"Ankle boot",
]
templates = ["an image of {}"]
benchmark = partial(
FashionMNIST, root="/fsx-robust/haideraltahan", train=False, download=True
)
handler = partial(
ZeroShotBenchmarkHandler,
benchmark_name="fashion_mnist_new",
classes=class_names,
templates=templates,
)
eval = Evaluator()
eval.add_benchmark(
benchmark,
handler,
meta_data={
"benchmark_type": "object recognition",
},
)
eval.update_benchmark_list(["fashion_mnist_new"])
eval.upd
Related Skills
node-connect
345.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
104.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
