Unibench

Python Library to evaluate VLM models' robustness across diverse benchmarks

Generate Convert Improve

Install / Use

/learn @facebookresearch/Unibench

About this skill

Quality Score

0/100

README

<a href="#getting-started">Getting Started</a> • <a href="#usage">Usage</a> • <a href="#sparkles-supported-models-and-benchmarks">Benchmarks & Models</a> • <a href="#credit_card-citation">Credit & Citation</a>

Vision-Language Model Evaluation Repository

This repository is designed to simplify the evaluation process of vision-language models. It provides a comprehensive set of tools and scripts for evaluating VLM models and benchmarks. We offer 60+ VLMs, inclusive of recent large-scale models like EVACLIP, with scales reaching up to 4.3B parameters and 12.8B training samples. Additionally, we provide implementations for 40+ evaluation benchmarks.

News and Updates

For the latest news and updates, see the snippet below.

April 15, 2025 - v0.4.0

Removed FaceNet from required libraries.
Added SigLIP2 models
Added bivlc benchmark
Created benchmark_builder for future benchmark implementations
Added News & Updates section in README
Fixed Sun397 benchmark

For full details, refer to the UPDATES.md file.

Coming Soon

[ ] L-VLM (e.g. PaliGemma, LlavaNext)

Getting Started

Choose the UniBench installation that best fits your use case:

🔧 Standard Installation

For full functionality including evaluation, visualization, and analysis:

pip install unibench[all]

📊 Minimal Version

Best for: Analyzing existing results without running new evaluations

pip install unibench

<details> <summary>What's included:</summary>

Download existing benchmark results
Visualize performance data with charts and graphs
Load results into pandas DataFrames for analysis
Compare model performance across benchmarks
Minimal dependencies for faster installation

For detailed usage, see the minimal installation guide.

</details>

🤖 New Model Evaluation

Best for: Testing your models against UniBench benchmarks

pip install unibench[new_model]

<details> <summary>What's included:</summary>

Evaluate HuggingFace models on all UniBench benchmarks
Test custom vision-language models
Add new model architectures to the evaluation pipeline
Support for CLIP, BLIP, and other VLM architectures
Comprehensive model performance analysis

For detailed usage, see the new model evaluation guide.

</details>

📋 New Benchmark Evaluation

Best for: Adding custom datasets and benchmarks

pip install unibench[new_benchmark]

<details> <summary>What's included:</summary>

Add custom datasets as new benchmarks
Evaluate all UniBench models on your benchmark
Support for classification, detection, and custom tasks
Flexible benchmark integration framework
Contribute new evaluation tasks to the community

For detailed usage, see the new benchmark evaluation guide.

</details>

🚀 Quick Start

After installation, verify your setup:

# List available models and benchmarks
unibench list_models
unibench list_benchmarks

# View existing results (all versions)
unibench show_results

# Run evaluation (standard installation only)
unibench evaluate

Usage

Print out Results from Evaluated Models

The following command will print the results of the evaluations on all benchmarks and models:

unibench show_results

Run Evaluation using Command Line

The following command will run the evaluation on all benchmarks and models:

unibench evaluate

Run Evaluation using Custom Script

The following command will run the evaluation on all benchmarks and models:

import unibench as vlm

evaluator = vlm.Evaluator()
evaluator.evaluate()

Arguments for Evaluation

evaluate function takes the following arguments:

Args:
    save_freq (int): The frequency at which to save results. Defaults to 1000.
    face_blur (bool): Whether to use face blurring during evaluation. Defaults to False.
    device (str): The device to use for evaluation. Defaults to "cuda" if available otherwise "cpu".
    batch_per_gpu (int): Evaluation batch size per GPU. Defaults to 32.

The Evaluator class takes the following arguments:

Args:
    seed (int): Random seed for reproducibility.
    num_workers (int): Number of workers for data loading.
    models (Union[List[str], str]): List of models to evaluate or "all" to evaluate all available models.
    benchmarks (Union[List[str], str]): List of benchmarks to evaluate or "all" to evaluate all available benchmarks.
    model_id (Union[int, None]): Specific model ID to evaluate.
    benchmark_id (Union[int, None]): Specific benchmark ID to evaluate.
    output_dir (str): Directory to save evaluation results.
    benchmarks_dir (str): Directory containing benchmark data.
    download_aggregate_precomputed (bool): Whether to download aggregate precomputed results.
    download_all_precomputed (bool): Whether to download all precomputed results.

Example

The following command will run the evaluation for openclip_vitB32 trained on metaclip400m and CLIP ResNet50 on vg_relation,clevr_distance,pcam,imageneta benchmarks:

unibench evaluate --models=[openclip_vitB32_metaclip_400m,clip_resnet50] --benchmarks=[vg_relation,clevr_distance,pcam,imageneta]

In addition to saving the results in ~/.cache/unibench, the output would be a summary of the evaluation results:

  model_name                      non-natural images   reasoning   relation   robustness  
 ──────────────────────────────────────────────────────────────────────────────────────── 
  clip_resnet50                   63.95                 14.89       54.13      23.27       
  openclip_vitB32_metaclip_400m   63.87                 19.46       51.54      28.71

Supported Models and benchmarks

Full list of models and benchmarks are available in the models_zoo and benchmarks_zoo. You are also able to run the following commands:

unibench list_models
# or
unibench list_benchmarks

Sample Models

| | Dataset Size (Million) | Number of Parameters (Million) | Learning Objective | Architecture | Model Name | | :----------------- | ---------------------: | -----------------------------: | :----------------- | :----------- | :------------ | | blip_vitB16_14m | 14 | 86 | BLIP | vit | BLIP ViT B 16 | | blip_vitL16_129m | 129 | 307 | BLIP | vit | BLIP ViT L 16 | | blip_vitB16_129m | 129 | 86 | BLIP | vit | BLIP ViT B 16 | | blip_vitB16_coco | 129 | 86 | BLIP | vit | BLIP ViT B 16 | | blip_vitB16_flickr | 129 | 86 | BLIP | vit | BLIP ViT B 16 |

Sample benchmarks

benchmarks Overview

| benchmark type | number of benchmarks | | :------------- | :------------------: | | ImageNet | 1 | | vtab | 18 | | transfer | 7 | | robustness | 6 | | relation | 6 | | corruption | 1 |

How results are saved

For each model, the results are saved in the output directory defined in constants: ~./.cache/unibench/outputs.

Add new Benchmark

To add new benchmark, you can simply inherit from the torch.utils.data.Dataset class and implement the __getitem__, and __len__ methods. For example, here is how to add ImageNetA as a new benchmark:

from functools import partial
from unibench import Evaluator
from unibench.benchmarks_zoo import ZeroShotBenchmarkHandler
from torchvision.datasets import FashionMNIST

class_names = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

templates = ["an image of {}"]

benchmark = partial(
    FashionMNIST, root="/fsx-robust/haideraltahan", train=False, download=True
)
handler = partial(
    ZeroShotBenchmarkHandler,
    benchmark_name="fashion_mnist_new",
    classes=class_names,
    templates=templates,
)


eval = Evaluator()

eval.add_benchmark(
    benchmark,
    handler,
    meta_data={
        "benchmark_type": "object recognition",
    },
)
eval.update_benchmark_list(["fashion_mnist_new"])
eval.upd

Related Skills

node-connect

345.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

104.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。