VILA: Optimized Vision Language Models

💡 Introduction

VILA is a family of open VLMs designed to optimize both efficiency and accuracy for efficient video understanding and multi-image understanding.

💡 News

[2025/7] We release OmniVinci, a state-of-the-art visual-audio joint understanding omni-modal LLM built upon VILA codebase!
[2025/7] We release Long-RL that supports RL training on VILA/LongVILA/NVILA models with long videos.
[2025/6] We release PS3 and VILA-HD. PS3 is a vision encoder that scales up vision pre-training to 4K resolution. VILA-HD is VILA with PS3 as the vision encoder and shows superior performance and efficiency in understanding high-resolution detail-rich images.
[2025/1] As of January 6, 2025 VILA is now part of the new Cosmos Nemotron vision language models.
[2024/12] We release NVILA (a.k.a VILA2.0) that explores the full stack efficiency of multi-modal design, achieving cheaper training, faster deployment and better performance.
[2024/12] We release LongVILA that supports long video understanding, with long-context VLM with more than 1M context length and multi-modal sequence parallel system.
[2024/10] VILA-M3, a SOTA medical VLM finetuned on VILA1.5 is released! VILA-M3 significantly outperforms Llava-Med and on par w/ Med-Gemini and is fully opensourced! code model
[2024/10] We release VILA-U: a Unified foundation model that integrates Video, Image, Language understanding and generation.
[2024/07] VILA1.5 also ranks 1st place (OSS model) on MLVU test leaderboard.
[2024/06] VILA1.5 is now the best open sourced VLM on MMMU leaderboard and Video-MME leaderboard!
[2024/05] We release VILA-1.5, which offers video understanding capability. VILA-1.5 comes with four model sizes: 3B/8B/13B/40B.

<details> <summary>Click to show more news</summary>

[2024/05] We release AWQ-quantized 4bit VILA-1.5 models. VILA-1.5 is efficiently deployable on diverse NVIDIA GPUs (A100, 4090, 4070 Laptop, Orin, Orin Nano) by TinyChat and TensorRT-LLM backends.
[2024/03] VILA has been accepted by CVPR 2024!
[2024/02] We release AWQ-quantized 4bit VILA models, deployable on Jetson Orin and laptops through TinyChat and TinyChatEngine.
[2024/02] VILA is released. We propose interleaved image-text pretraining that enables multi-image VLM. VILA comes with impressive in-context learning capabilities. We open source everything: including training code, evaluation code, datasets, model ckpts.
[2023/12] Paper is on Arxiv!

</details>

Performance

Image Benchmarks

Video Benchmarks

Efficient Deployments

NOTE: Measured using the TinyChat backend at batch size = 1.

Inference Performance

Decoding Throughput ( Token/sec )

| $~~~~~~$ | A100 | 4090 | Orin | | --------------------------- | ----- | ----- | ---- | | NVILA-3B-Baseline | 140.6 | 190.5 | 42.7 | | NVILA-3B-TinyChat | 184.3 | 230.5 | 45.0 | | NVILA-Lite-3B-Baseline | 142.3 | 190.0 | 41.3 | | NVILA-Lite-3B-TinyChat | 186.0 | 233.9 | 44.9 | | NVILA-8B-Baseline | 82.1 | 61.9 | 11.6 | | NVILA-8B-TinyChat | 186.8 | 162.7 | 28.1 | | NVILA-Lite-8B-Baseline | 84.0 | 62.0 | 11.6 | | NVILA-Lite-8B-TinyChat | 181.8 | 167.5 | 32.8 | | NVILA-Video-8B-Baseline * | 73.2 | 58.4 | 10.9 | | NVILA-Video-8B-TinyChat * | 151.8 | 145.0 | 32.3 |

TTFT (Time-To-First-Token) ( Sec )

| $~~~~~~$ | A100 | 4090 | Orin | | --------------------------- | ------ | ------ | ------ | | NVILA-3B-Baseline | 0.0329 | 0.0269 | 0.1173 | | NVILA-3B-TinyChat | 0.0260 | 0.0188 | 0.1359 | | NVILA-Lite-3B-Baseline | 0.0318 | 0.0274 | 0.1195 | | NVILA-Lite-3B-TinyChat | 0.0314 | 0.0191 | 0.1241 | | NVILA-8B-Baseline | 0.0434 | 0.0573 | 0.4222 | | NVILA-8B-TinyChat | 0.0452 | 0.0356 | 0.2748 | | NVILA-Lite-8B-Baseline | 0.0446 | 0.0458 | 0.2507 | | NVILA-Lite-8B-TinyChat | 0.0391 | 0.0297 | 0.2097 | | NVILA-Video-8B-Baseline * | 0.7190 | 0.8840 | 5.8236 | | NVILA-Video-8B-TinyChat * | 0.6692 | 0.6815 | 5.8425 |

NOTE: Measured using the TinyChat backend at batch size = 1, dynamic_s2 disabled, and num_video_frames = 64. We use W4A16 LLM and W8A8 Vision Tower for Tinychat and the baseline precision is FP16. *: Measured with video captioning task. Otherwise, measured with image captioning task.

VILA Examples

Video captioning

https://github.com/Efficient-Large-Model/VILA/assets/156256291/c9520943-2478-4f97-bc95-121d625018a6

Prompt: Elaborate on the visual and narrative elements of the video in detail.

Caption: The video shows a person's hands working on a white surface. They are folding a piece of fabric with a checkered pattern in shades of blue and white. The fabric is being folded into a smaller, more compact shape. The person's fingernails are painted red, and they are wearing a black and red garment. There are also a ruler and a pencil on the surface, suggesting that measurements and precision are involved in the process.

In context learning

Multi-image reasoning

VILA on Jetson Orin

https://github.com/Efficient-Large-Model/VILA/assets/7783214/6079374c-0787-4bc4-b9c6-e1524b4c9dc4

VILA on RTX 4090

https://github.com/Efficient-Large-Model/VILA/assets/7783214/80c47742-e873-4080-ad7d-d17c4700539f

Installation

Install Anaconda Distribution.
Install the necessary Python packages in the environment.
```
./environment_setup.sh vila
```
(Optional) If you are an NVIDIA employee with a wandb account, install onelogger and enable it by setting training_args.use_one_logger to True in llava/train/args.py.
```
pip install --index-url=https://sc-hw-artf.nvidia.com/artifactory/api/pypi/hwinf-mlwfo-pypi/simple --upgrade one-logger-utils
```
Activate a conda environment.
```
conda activate vila
```

Training

VILA training contains three steps, for specific hyperparameters, please check out the scripts/NVILA-Lite folder:

Step-1: Alignment

We utilize LLaVA-CC3M-Pretrain-595K dataset to align the textual and visual modalities.

The stage 1 script takes in two parameters and it can run on a single 8xA100 node.

bash scripts/NVILA-Lite/align.sh Efficient-Large-Model/Qwen2-VL-7B-Instruct <alias to data>

and the trained models will be saved to runs/train/nvila-8b-align.

Step-1.5:

bash scripts/NVILA-Lite/stage15.sh runs/train/nvila-8b-align/model <alias to data>

and the trained models will be saved to runs/train/nvila-8b-align-1.5.

Step-2: Pretraining

We use MMC4 and Coyo dataset to train VLM with interleaved image-text pairs.

bash scripts/NVILA-Lite/pretrain.sh runs/train/nvila-8b-align-1.5 <alias to data>

and the trained models will be saved to runs/train/nvila-8b-pretraining.

Step-3: Supervised fine-tuning

This is the last stage of VILA training, in which we tune the model to follow multimodal instructions on a subset of M3IT, FLAN and ShareGPT4V. This stage runs on a 8xA100 node.

bash scripts/NVILA-Lite/sft.sh runs/train/nvila-8b-pretraining <alias to data>

and the trained models will be saved to runs/train/nvila-8b-SFT.

Evaluations

We have introduce vila-eval command to simplify the evaluation. Once the data is prepared, the evaluation can be launched via

MODEL_NAME=NVILA-15B
MODEL_ID=Efficient-Large-Model/$MODEL_NAME
huggingface-cli download $MODEL_ID

vila-eval \
    --model-name $MODEL_NAME \
    --model-path $MODEL_ID \
    --conv-mode auto \
    --tags-include local

it will launch all evaluations and return a summarized result.

Inference

We provide vila-infer for quick inference with user prompts and images.

# image description
vila-infer \
    --model-path Efficient-Large-Model/NVILA-15B \
    --conv-mode auto \
    --text "Please describe the image" \
    --media demo_images/demo_img.png

# video description
vila-infer \
    --model-path Efficient-Large-Model/NVILA-15B \

VILA

Install / Use

README