Lagernvs

Official code for "LagerNVS Latent Geometry for Fully Neural Real-time Novel View Synthesis" (CVPR 2026)

Generate Convert Improve

Install / Use

/learn @facebookresearch/Lagernvs

About this skill

Quality Score

0/100

README

LagerNVS: Latent Geometry for Fully Neural Real-Time Novel View Synthesis

<a href="https://arxiv.org/abs/2603.20176"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b" alt="arXiv"></a> <a href="https://szymanowiczs.github.io/lagernvs"><img src="https://img.shields.io/badge/🌐-Project_Page-orange" alt="Project Page"></a> <a href="https://github.com/facebookresearch/lagernvs"><img src="https://img.shields.io/badge/GitHub-Repo-blue" alt="GitHub"></a> <a href="https://huggingface.co/collections/facebook/lagernvs"><img src="https://img.shields.io/badge/HuggingFace-Model-green?logo=huggingface" alt="Models"></a> Stanislaw Szymanowicz1,2, Minghao Chen1,2, Jianyuan Wang1,2, Christian Rupprecht1, Andrea Vedaldi1,2 1Visual Geometry Group (VGG), University of Oxford    2Meta AI

LagerNVS is a feed-forward model for novel view synthesis (NVS). Given one or more input images of a scene, it synthesizes new views from a target cameras. It generalizes to in-the-wild data, renders new views in real time and can operate with or without known source camera poses. The model uses 3D biases without explicit 3D representations. The architecture features a large 3D-aware encoder (from VGGT pre-training) to extract scene tokens and a transformer-based renderer that conditions on these tokens via cross-attention to render novel views.

Announcements

[27 Mar 2026] We are aware that the model produces unsatisfactory results when target camera intrinsics vary substantially from source camera intrinsics. We are working on a fix.

[27 Mar 2026] We have technical issues with granting access to the model. It might take a couple of days after filing the request before you get the access, and we are actively working to reduce that time.

Installation

# Clone the repository
git clone https://github.com/facebookresearch/lagernvs.git
cd lagernvs

# Create conda environment
conda create -n lagernvs python=3.10
conda activate lagernvs

# Install PyTorch (CUDA 12.6)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu126

# Install remaining dependencies
pip install -r requirements.txt

Model Access

The model checkpoints are hosted on HuggingFace as gated repositories. You must authenticate before downloading:

Create a HuggingFace account at https://huggingface.co if you don't have one.
Request access by visiting the model page (e.g., facebook/lagernvs_general_512) and clicking "Agree and access repository".
Create an access token at https://huggingface.co/settings/tokens. Create a token with at least Read scope.
Set the token as an environment variable:

export HF_TOKEN=hf_your_token_here

To persist this across sessions, add the export to your ~/.bashrc or write the token to the HuggingFace cache:

mkdir -p ~/.cache/huggingface
echo "hf_your_token_here" > ~/.cache/huggingface/token

You can verify access with:

python -c "from huggingface_hub import list_repo_files; print('\n'.join(list_repo_files('facebook/lagernvs_general_512')))"

Minimal Inference

Run inference with the general model on your own images:

python minimal_inference.py --images path/to/img1.png path/to/img2.png

To use a different checkpoint (see Available Checkpoints below):

# Re10k model (256px, 2-view, posed)
python minimal_inference.py \
    --images path/to/img1.png path/to/img2.png \
    --model_repo facebook/lagernvs_re10k_2v_256 \
    --attention_type full_attention \
    --target_size 256 \
    --mode square_crop

# DL3DV model (256px, 2-6 views, posed)
python minimal_inference.py \
    --images path/to/img1.png path/to/img2.png path/to/img3.png \
    --model_repo facebook/lagernvs_dl3dv_2-6_v_256 \
    --attention_type bidirectional_cross_attention \
    --target_size 256 \
    --mode square_crop

Run python minimal_inference.py --help for all options (--video_length, --output, etc.).

See minimal_inference.py for the fully commented source of truth. For interactive step-by-step exploration with visualization of intermediate results (loaded images, camera trajectories, sampled output frames), see the inference.ipynb notebook.

Available Checkpoints

All models use the EncDecVitB/8 architecture (VGGT encoder + 12-layer renderer, patch size 8). Three checkpoints are available on HuggingFace. We recommend using the general model for most use cases, as it is trained on a large dataset of scenes and can handle a wide range of input conditions. Re10k and DL3DV models are shared primarily for benchmarking and reproducitbility.

| Checkpoint | HuggingFace Repo | Training Data | Resolution | Train Cond. Views | Camera Poses | |-----------|-----------------|---------------|------------|-------------|--------------| | General | facebook/lagernvs_general_512 | 13 datasets | 512 (longer side) | 1-10 | Posed and unposed | | Re10k | facebook/lagernvs_re10k_2v_256 | Re10k only | 256x256 | 2 | Posed only | | DL3DV | facebook/lagernvs_dl3dv_2-6_v_256 | DL3DV only | 256x256 | 2-6 | Posed only |

Checkpoints are auto-downloaded from HuggingFace when using hf:// paths in config files.

Evaluation

Re10k-only model: posed, 2-view, 256x256

Download Re10k data prepared by pixelSplat (CVPR 2024), hosted here and unzip.
Set up your data root directory. Download and run the preprocessing script from LVSM process_data.py:

export LAGERNVS_DATA_ROOT=/path/to/your/data

# Preprocess test split
python process_data.py \
    --base_path /path/to/downloaded_and_unzipped/re10k_from_pixelsplat \
    --output_dir $LAGERNVS_DATA_ROOT/re10k \
    --mode test

The expected dataset organization after preprocessing is:

$LAGERNVS_DATA_ROOT/re10k/
└── test/
    ├── images/
    │   ├── <sequence_id>/
    │   │   ├── 00000.png
    │   │   ├── 00001.png
    │   │   └── ...
    ├── metadata/
    │   ├── <sequence_id>.json
    │   └── ...
    └── full_list.txt

Run evaluation:

# Evaluate on Re10k (posed, 2-view, 256x256)
torchrun --nproc_per_node=8 run_eval.py \
    -c config/eval_re10k.yaml \
    -e re10k_eval

The script defaults to using 8 GPUs with global batch size 512. By default, it saves images and renders videos as part of evaluation - this can be slow and use a lot of memory and storage. Adjust the batch and GPU size according to your hardware and optionally remove visualization saving from run_eval.py .

Verify expected scores: PSNR: 31.39 SSIM: 0.928 LPIPS: 0.078

DL3DV-only model: posed, 2-, 4-, 6- view, 256x256

Download the DL3DV benchmark subset required for evaluation:

export LAGERNVS_DATA_ROOT=/path/to/your/data

cd data_prep/dl3dv
python download_eval.py \
    --output_dir $LAGERNVS_DATA_ROOT/dl3dv \
    --view_indices_path ../../assets/dl3dv_6v.json

This script automatically downloads scenes from the correct HuggingFace repositories:

Most scenes come from DL3DV/DL3DV-ALL-960P (scenes with XK prefix like "2K/...", "3K/...")
A few (5) benchmark scenes are not included in the 'ALL' version, and thus have to be separately downloaded from DL3DV/DL3DV-10K-Benchmark

Note: You need to request access to the DL3DV datasets on HuggingFace and authenticate via huggingface-cli login or add your token to environment variables before running the download script.

The data structure should look like:

$LAGERNVS_DATA_ROOT/dl3dv/
├── <batch>/<sequence_id>/
│   ├── images_4/
│   │   ├── frame_00001.png
│   │   └── ...
│   └── transforms.json
├── full_list_train.txt
└── full_list_test.txt

Run evaluation:

# Evaluate on DL3DV (posed, 6-view, 256x256)
torchrun --nproc_per_node=8 run_eval.py \
    -c config/eval_dl3dv.yaml \
    -e dl3dv_eval

We provide view indices for 2-view, 4-view, and 6-view evaluation in assets/dl3dv_2v.json, assets/dl3dv_4v.json, and assets/dl3dv_6v.json. To evaluate with a different number of views, update the download command and modify config/eval_dl3dv.yaml to point to the appropriate JSON file and set num_cond_views accordingly.

Verify expected scores (6-view): PSNR: 29.45 SSIM: 0.904 LPIPS: 0.068

General Model (512 resolution)

The general model can be evaluated on any dataset that has been preprocessed in the format described above (Re10k or DL3DV format). Here we show unposed evaluation on DL3DV at 512x512 resolution.

Ensure DL3DV data is prepared as described in the DL3DV section above.
Run evaluation:

# Evaluate general model on DL3DV (unposed, 6-view, 512x512)
torchrun --nproc_per_node=8 run_eval.py \
    -c config/eval_dl3dv_general.yaml \
    -e dl3dv_general_eval

Customizing evaluation settings:

The config file config/eval_dl3dv_general.yaml can be modified for different evaluation scenarios:

Posed vs unposed: Set `zero_out_cam_cond_p: 0.

Related Skills

node-connect

339.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.8k

Commit, push, and open a PR