NaVILA
[RSS'25] This repository is the implementation of "NaVILA: Legged Robot Vision-Language-Action Model for Navigation"
Install / Use
/learn @AnjieCheng/NaVILAREADME
NaVILA: Legged Robot Vision-Language-Action Model for Navigation (RSS'25)
<p align="center"> <img src="assets/teaser.gif" width="600"> </p> </div>💡 Introduction
NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.
<p align="center"> <img src="assets/method.png" width="600"> </p>TODO
- [x] Release mode/weight/evaluation.
- [x] Release training code. (around June 30th)
- [x] Release YouTube Human Touring dataset. (around June 30th)
- [x] Release Isaac Sim evaluation, please see here.
🚀 Training
Installation
To build environment for training NaVILA, please run the following:
./environment_setup.sh navila
conda activate navila
Optional: If you plan to use TensorBoard for logging, install tensorboardX via pip.
Dataset
For general VQA datasets like video_chatgpt, sharegpt_video, sharegpt4v_sft, please follow the data preparation instructions in NVILA.
We provide annotations for envdrop, scanqa, r2r, rxr, and human on Hugging Face.
Please download the repo and extract the tar.gz files in their respective subfolders.
-
YouTube Human Touring:
Due to copyright restrictions, raw videos/images are not released. We provide video IDs and annotations. You can download the videos usingyt-dlpand extract frames using:scripts/extract_rawframes.py -
EnvDrop:
Due to the large number of videos, we provide annotations only. Please download the R2R augmented split from R2R_VLNCE_v1-3_preprocessed.zip and render corresponding videos using VLN-CE.
The data should have structure like:
NaVILA-Dataset
├─ EnvDrop
| ├─ videos
| | ├─ 1.mp4
| | ├─ ...
| ├─ annotations.json
├─ Human
| ├─ raw_frames
| | ├─ Aei0GpsWNys
| | | ├─ 0001.jpg
| | | ├─ ...
| | ├─ ...
| ├─ videos
| | ├─ Aei0GpsWNys.mp4
| | ├─ ...
| ├─ annotations.json
| ├─ video_ids.txt
├─ R2R
| ├─ train
| | ├─ 1
| | | ├─ frame_0.jpg
| | | ├─ ...
| | ├─ ...
| ├─ annotations.json
├─ RxR
| ├─ train
| | ├─ 1
| | | ├─ frame_0.jpg
| | | ├─ ...
| | ├─ ...
| ├─ annotations.json
├─ ScanQA
| ├─ videos
| | ├─ scene0760_00.mp4
| | ├─ ...
| ├─ annotations
| | ├─ ScanQA_v1.0_train_reformat.json
| | ├─ ...
Training
The pretrain model to start from is provided in a8cheng/navila-siglip-llama3-8b-v1.5-pretrain. Please modify the data paths in llava/data/datasets_mixture.py and use the script in scripts/train/sft_8frames.sh to lanuch the training.
📊 Evaluation
Installation
This repository builds on VLN-CE, which relies on older versions of Habitat-Lab and Habitat-Sim. The installation process requires several modifications and can be complex.
- Create a Conda Environment with Python 3.10
conda create -n navila-eval python=3.10
conda activate navila-eval
- Build Habitat-Sim & Lab (v0.1.7) from Source
Follow the VLN-CE setup guide. To resolve NumPy compatibility issues, apply the following hotfix:
python evaluation/scripts/habitat_sim_autofix.py # replace habitat_sim/utils/common.py
- Install VLN-CE Dependencies
pip install -r evaluation/requirements.txt
- Install VILA Dependencies
# Install FlashAttention2
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# Install VILA (assum in root dir)
pip install -e .
pip install -e ".[train]"
pip install -e ".[eval]"
# Install HF's Transformers
pip install git+https://github.com/huggingface/transformers@v4.37.2
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/
cp -rv ./llava/train/deepspeed_replace/* $site_pkg_path/deepspeed/
- Fix WebDataset Version for VLN-CE Compatibility
pip install webdataset==0.1.103
Data
Please follow VLN-CE and download R2R and RxR annotations, and scene data inside the evaluation/data folder. The data should have structure like:
data/datasets
├─ RxR_VLNCE_v0
| ├─ train
| | ├─ train_guide.json.gz
| | ├─ ...
| ├─ val_unseen
| | ├─ val_unseen_guide.json.gz
| | ├─ ...
| ├─ ...
├─ R2R_VLNCE_v1-3_preprocessed
| ├─ train
| | ├─ train.json.gz
| | ├─ ...
| ├─ val_unseen
| | ├─ val_unseen.json.gz
| | ├─ ...
data/scene_datasets
├─ mp3d
| ├─ 17DRP5sb8fy
| | ├─ 17DRP5sb8fy.glb
| | ├─ ...
| ├─ ...
Running Evaluation
- Download the checkpoint from a8cheng/navila-llama3-8b-8f.
- Run evaluation on R2R using:
cd evaluation
bash scripts/eval/r2r.sh CKPT_PATH NUM_CHUNKS CHUNK_START_IDX "GPU_IDS"
Examples:
- Single GPU:
bash scripts/eval/r2r.sh CKPT_PATH 1 0 "0" - Multiple GPUs (e.g., 8 GPUs):
bash scripts/eval/r2r.sh CKPT_PATH 8 0 "0,1,2,3,4,5,6,7"
- Visualized videos are saved in
./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen/videos
<p align="center">
<img src="assets/sample.gif" width="600">
</p>
4. Aggregate results and view the scores
python scripts/eval_jsons.py ./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen NUM_CHUNKS
📜 Citation
@inproceedings{cheng2025navila,
title={Navila: Legged robot vision-language-action model for navigation},
author={Cheng, An-Chieh and Ji, Yandong and Yang, Zhaojing and Gongye, Zaitian and Zou, Xueyan and Kautz, Jan and B{\i}y{\i}k, Erdem and Yin, Hongxu and Liu, Sifei and Wang, Xiaolong},
booktitle={RSS},
year={2025}
}
Related Skills
node-connect
334.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
334.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.2kCommit, push, and open a PR
