SkillAgentSearch skills...

NaVILA

[RSS'25] This repository is the implementation of "NaVILA: Legged Robot Vision-Language-Action Model for Navigation"

Install / Use

/learn @AnjieCheng/NaVILA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <p align="center"> <img src="assets/logo.png" width="20%"/> </p>

NaVILA: Legged Robot Vision-Language-Action Model for Navigation (RSS'25)

website Arxiv Huggingface Locomotion Code

<p align="center"> <img src="assets/teaser.gif" width="600"> </p> </div>

💡 Introduction

NaVILA is a two-level framework that combines VLAs with locomotion skills for navigation. It generates high-level language-based commands, while a real-time locomotion policy ensures obstacle avoidance.

<p align="center"> <img src="assets/method.png" width="600"> </p>

TODO

  • [x] Release mode/weight/evaluation.
  • [x] Release training code. (around June 30th)
  • [x] Release YouTube Human Touring dataset. (around June 30th)
  • [x] Release Isaac Sim evaluation, please see here.

🚀 Training

Installation

To build environment for training NaVILA, please run the following:

./environment_setup.sh navila
conda activate navila

Optional: If you plan to use TensorBoard for logging, install tensorboardX via pip.

Dataset

For general VQA datasets like video_chatgpt, sharegpt_video, sharegpt4v_sft, please follow the data preparation instructions in NVILA. We provide annotations for envdrop, scanqa, r2r, rxr, and human on Hugging Face. Please download the repo and extract the tar.gz files in their respective subfolders.

<p align="center"> <img src="assets/human_touring.gif" width="600"> </p>
  • YouTube Human Touring:
    Due to copyright restrictions, raw videos/images are not released. We provide video IDs and annotations. You can download the videos using yt-dlp and extract frames using: scripts/extract_rawframes.py

  • EnvDrop:
    Due to the large number of videos, we provide annotations only. Please download the R2R augmented split from R2R_VLNCE_v1-3_preprocessed.zip and render corresponding videos using VLN-CE.

The data should have structure like:

NaVILA-Dataset
├─ EnvDrop
|   ├─ videos
|   |    ├─ 1.mp4
|   |    ├─ ...
|   ├─ annotations.json
├─ Human
|   ├─ raw_frames
|   |    ├─ Aei0GpsWNys
|   |    |    ├─ 0001.jpg
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ videos
|   |    ├─ Aei0GpsWNys.mp4
|   |    ├─ ...
|   ├─ annotations.json
|   ├─ video_ids.txt
├─ R2R
|   ├─ train
|   |    ├─ 1
|   |    |    ├─ frame_0.jpg 
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ annotations.json
├─ RxR
|   ├─ train
|   |    ├─ 1
|   |    |    ├─ frame_0.jpg 
|   |    |    ├─ ...
|   |    ├─ ...
|   ├─ annotations.json
├─ ScanQA
|   ├─ videos
|   |    ├─ scene0760_00.mp4
|   |    ├─ ...
|   ├─ annotations
|   |    ├─ ScanQA_v1.0_train_reformat.json
|   |    ├─ ...

Training

The pretrain model to start from is provided in a8cheng/navila-siglip-llama3-8b-v1.5-pretrain. Please modify the data paths in llava/data/datasets_mixture.py and use the script in scripts/train/sft_8frames.sh to lanuch the training.

📊 Evaluation

Installation

This repository builds on VLN-CE, which relies on older versions of Habitat-Lab and Habitat-Sim. The installation process requires several modifications and can be complex.

  1. Create a Conda Environment with Python 3.10
conda create -n navila-eval python=3.10
conda activate navila-eval
  1. Build Habitat-Sim & Lab (v0.1.7) from Source

Follow the VLN-CE setup guide. To resolve NumPy compatibility issues, apply the following hotfix:

python evaluation/scripts/habitat_sim_autofix.py # replace habitat_sim/utils/common.py
  1. Install VLN-CE Dependencies
pip install -r evaluation/requirements.txt
  1. Install VILA Dependencies
# Install FlashAttention2
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.8/flash_attn-2.5.8+cu122torch2.3cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

# Install VILA (assum in root dir)
pip install -e .
pip install -e ".[train]"
pip install -e ".[eval]"

# Install HF's Transformers
pip install git+https://github.com/huggingface/transformers@v4.37.2
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/
cp -rv ./llava/train/deepspeed_replace/* $site_pkg_path/deepspeed/
  1. Fix WebDataset Version for VLN-CE Compatibility
pip install webdataset==0.1.103

Data

Please follow VLN-CE and download R2R and RxR annotations, and scene data inside the evaluation/data folder. The data should have structure like:

data/datasets
├─ RxR_VLNCE_v0
|   ├─ train
|   |    ├─ train_guide.json.gz
|   |    ├─ ...
|   ├─ val_unseen
|   |    ├─ val_unseen_guide.json.gz
|   |    ├─ ...
|   ├─ ...
├─ R2R_VLNCE_v1-3_preprocessed
|   ├─ train
|   |    ├─ train.json.gz
|   |    ├─ ...
|   ├─ val_unseen
|   |    ├─ val_unseen.json.gz
|   |    ├─ ...
data/scene_datasets
├─ mp3d
|   ├─ 17DRP5sb8fy
|   |    ├─ 17DRP5sb8fy.glb
|   |    ├─ ...
|   ├─ ...

Running Evaluation

  1. Download the checkpoint from a8cheng/navila-llama3-8b-8f.
  2. Run evaluation on R2R using:
cd evaluation
bash scripts/eval/r2r.sh CKPT_PATH NUM_CHUNKS CHUNK_START_IDX "GPU_IDS"

Examples:

  • Single GPU:
    bash scripts/eval/r2r.sh CKPT_PATH 1 0 "0"
    
  • Multiple GPUs (e.g., 8 GPUs):
    bash scripts/eval/r2r.sh CKPT_PATH 8 0 "0,1,2,3,4,5,6,7"
    
  1. Visualized videos are saved in
./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen/videos
<p align="center"> <img src="assets/sample.gif" width="600"> </p> 4. Aggregate results and view the scores
python scripts/eval_jsons.py ./eval_out/CKPT_NAME/VLN-CE-v1/val_unseen NUM_CHUNKS

📜 Citation

@inproceedings{cheng2025navila,
        title={Navila: Legged robot vision-language-action model for navigation},
        author={Cheng, An-Chieh and Ji, Yandong and Yang, Zhaojing and Gongye, Zaitian and Zou, Xueyan and Kautz, Jan and B{\i}y{\i}k, Erdem and Yin, Hongxu and Liu, Sifei and Wang, Xiaolong},
        booktitle={RSS},
        year={2025}
}

Related Skills

View on GitHub
GitHub Stars557
CategoryDevelopment
Updated13h ago
Forks46

Languages

Python

Security Score

95/100

Audited on Mar 24, 2026

No findings