SkillAgentSearch skills...

Unibench

Python Library to evaluate VLM models' robustness across diverse benchmarks

Install / Use

/learn @facebookresearch/Unibench
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

| header | | :------------------------------------------------: | | [Arxiv link] |

<p align="center"> <a href="#getting-started">Getting Started</a> • <a href="#usage">Usage</a> • <a href="#sparkles-supported-models-and-benchmarks">Benchmarks & Models</a> • <a href="#credit_card-citation">Credit & Citation</a> </p>

Vision-Language Model Evaluation Repository

This repository is designed to simplify the evaluation process of vision-language models. It provides a comprehensive set of tools and scripts for evaluating VLM models and benchmarks. We offer 60+ VLMs, inclusive of recent large-scale models like EVACLIP, with scales reaching up to 4.3B parameters and 12.8B training samples. Additionally, we provide implementations for 40+ evaluation benchmarks.

News and Updates

For the latest news and updates, see the snippet below.

April 15, 2025 - v0.4.0

  • Removed FaceNet from required libraries.
  • Added SigLIP2 models
  • Added bivlc benchmark
  • Created benchmark_builder for future benchmark implementations
  • Added News & Updates section in README
  • Fixed Sun397 benchmark

For full details, refer to the UPDATES.md file.

Coming Soon

  • [ ] L-VLM (e.g. PaliGemma, LlavaNext)

Getting Started

Choose the UniBench installation that best fits your use case:

🔧 Standard Installation

For full functionality including evaluation, visualization, and analysis:

pip install unibench[all]

📊 Minimal Version

Best for: Analyzing existing results without running new evaluations

pip install unibench
<details> <summary><b>What's included:</b></summary>
  • Download existing benchmark results
  • Visualize performance data with charts and graphs
  • Load results into pandas DataFrames for analysis
  • Compare model performance across benchmarks
  • Minimal dependencies for faster installation

For detailed usage, see the minimal installation guide.

</details>

🤖 New Model Evaluation

Best for: Testing your models against UniBench benchmarks

pip install unibench[new_model]
<details> <summary><b>What's included:</b></summary>
  • Evaluate HuggingFace models on all UniBench benchmarks
  • Test custom vision-language models
  • Add new model architectures to the evaluation pipeline
  • Support for CLIP, BLIP, and other VLM architectures
  • Comprehensive model performance analysis

For detailed usage, see the new model evaluation guide.

</details>

📋 New Benchmark Evaluation

Best for: Adding custom datasets and benchmarks

pip install unibench[new_benchmark]
<details> <summary><b>What's included:</b></summary>
  • Add custom datasets as new benchmarks
  • Evaluate all UniBench models on your benchmark
  • Support for classification, detection, and custom tasks
  • Flexible benchmark integration framework
  • Contribute new evaluation tasks to the community

For detailed usage, see the new benchmark evaluation guide.

</details>

🚀 Quick Start

After installation, verify your setup:

# List available models and benchmarks
unibench list_models
unibench list_benchmarks

# View existing results (all versions)
unibench show_results

# Run evaluation (standard installation only)
unibench evaluate

Usage

Print out Results from Evaluated Models

The following command will print the results of the evaluations on all benchmarks and models:

unibench show_results

Run Evaluation using Command Line

The following command will run the evaluation on all benchmarks and models:

unibench evaluate

Run Evaluation using Custom Script

The following command will run the evaluation on all benchmarks and models:

import unibench as vlm

evaluator = vlm.Evaluator()
evaluator.evaluate()

Arguments for Evaluation

evaluate function takes the following arguments:

Args:
    save_freq (int): The frequency at which to save results. Defaults to 1000.
    face_blur (bool): Whether to use face blurring during evaluation. Defaults to False.
    device (str): The device to use for evaluation. Defaults to "cuda" if available otherwise "cpu".
    batch_per_gpu (int): Evaluation batch size per GPU. Defaults to 32.

The Evaluator class takes the following arguments:

Args:
    seed (int): Random seed for reproducibility.
    num_workers (int): Number of workers for data loading.
    models (Union[List[str], str]): List of models to evaluate or "all" to evaluate all available models.
    benchmarks (Union[List[str], str]): List of benchmarks to evaluate or "all" to evaluate all available benchmarks.
    model_id (Union[int, None]): Specific model ID to evaluate.
    benchmark_id (Union[int, None]): Specific benchmark ID to evaluate.
    output_dir (str): Directory to save evaluation results.
    benchmarks_dir (str): Directory containing benchmark data.
    download_aggregate_precomputed (bool): Whether to download aggregate precomputed results.
    download_all_precomputed (bool): Whether to download all precomputed results.

Example

The following command will run the evaluation for openclip_vitB32 trained on metaclip400m and CLIP ResNet50 on vg_relation,clevr_distance,pcam,imageneta benchmarks:

unibench evaluate --models=[openclip_vitB32_metaclip_400m,clip_resnet50] --benchmarks=[vg_relation,clevr_distance,pcam,imageneta]

In addition to saving the results in ~/.cache/unibench, the output would be a summary of the evaluation results:

  model_name                      non-natural images   reasoning   relation   robustness  
 ──────────────────────────────────────────────────────────────────────────────────────── 
  clip_resnet50                   63.95                 14.89       54.13      23.27       
  openclip_vitB32_metaclip_400m   63.87                 19.46       51.54      28.71   

Supported Models and benchmarks

Full list of models and benchmarks are available in the models_zoo and benchmarks_zoo. You are also able to run the following commands:

unibench list_models
# or
unibench list_benchmarks

Sample Models

| | Dataset Size (Million) | Number of Parameters (Million) | Learning Objective | Architecture | Model Name | | :----------------- | ---------------------: | -----------------------------: | :----------------- | :----------- | :------------ | | blip_vitB16_14m | 14 | 86 | BLIP | vit | BLIP ViT B 16 | | blip_vitL16_129m | 129 | 307 | BLIP | vit | BLIP ViT L 16 | | blip_vitB16_129m | 129 | 86 | BLIP | vit | BLIP ViT B 16 | | blip_vitB16_coco | 129 | 86 | BLIP | vit | BLIP ViT B 16 | | blip_vitB16_flickr | 129 | 86 | BLIP | vit | BLIP ViT B 16 |

Sample benchmarks

| | benchmark | benchmark_type | | :------------- | :-------- | :------------- | | clevr_distance | zero-shot | vtab | | fgvc_aircraft | zero-shot | transfer | | objectnet | zero-shot | robustness | | winoground | relation | relation | | imagenetc | zero-shot | corruption |

benchmarks Overview

| benchmark type | number of benchmarks | | :------------- | :------------------: | | ImageNet | 1 | | vtab | 18 | | transfer | 7 | | robustness | 6 | | relation | 6 | | corruption | 1 |

<!-- ## :sparkles: Features/Objectives - Ease-of-use VLM evaluation repo. - Evaluate existing and future VLMs on benchmarks without extensive code - Evaluate on existing and future benchmarks without extensive code ## :pencil2: Repository Structure The repository is organized into the following directories: - `common_utils`: Scripts for common utilities used throughout the repository. - `benchmarks_zoo`: Scripts for loading various benchmarks. - `models_zoo`: Scripts for loading various models. - `slurm_scripts`: Scripts for running the evaluation in parallel on a SLURM cluster. - `main.py`: Script for running the evaluation. - `plotter.py`: Script for plotting the results. - `output.py`: Script for saving the results. -->

How results are saved

For each model, the results are saved in the output directory defined in constants: ~./.cache/unibench/outputs.

Add new Benchmark

To add new benchmark, you can simply inherit from the torch.utils.data.Dataset class and implement the __getitem__, and __len__ methods. For example, here is how to add ImageNetA as a new benchmark:

from functools import partial
from unibench import Evaluator
from unibench.benchmarks_zoo import ZeroShotBenchmarkHandler
from torchvision.datasets import FashionMNIST

class_names = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

templates = ["an image of {}"]

benchmark = partial(
    FashionMNIST, root="/fsx-robust/haideraltahan", train=False, download=True
)
handler = partial(
    ZeroShotBenchmarkHandler,
    benchmark_name="fashion_mnist_new",
    classes=class_names,
    templates=templates,
)


eval = Evaluator()

eval.add_benchmark(
    benchmark,
    handler,
    meta_data={
        "benchmark_type": "object recognition",
    },
)
eval.update_benchmark_list(["fashion_mnist_new"])
eval.upd

Related Skills

View on GitHub
GitHub Stars222
CategoryDevelopment
Updated1mo ago
Forks22

Languages

Jupyter Notebook

Security Score

80/100

Audited on Feb 25, 2026

No findings