Eagle
Eagle: Frontier Vision-Language Models with Data-Centric Strategies
Install / Use
/learn @NVlabs/EagleREADME
🦅 Eagle: Frontier Vision-Language Models with Data-Centric Strategies
<p> <img src="Eagle/assets/Eagle.png" alt="Eagle" width="500" height="auto"> </p>[📘Eagle 2.5 Report] [📘Eagle 2 Report] [📘Eagle Report] [🤗HF Models] [🤗HF Demo] [🌐Project Page]
</div>Updates
- [2025/10] 🔥 Release Eagle 2.5 source code.
- [2025/09] 🔥 Eagle 2.5 is accepted to NeurIPS 2025.
- [2025/09] 🎉 Eagle 2 is supported in Torch-TRT.
- [2025/07] 🎉 Release Eagle 2.5 model.
- [2025/06] 🔥 Eagle 2.5 is adopted as the VLM backbone of GR00T-N1.5. Check out the tech blog for more details.
- [2025/04] 🎉 Release Eagle 2.5 tech report.
- [2025/03] 🔥 Eagle 2 is adopted as the VLM backbone (System-2) of GR00T-N1. Check out the GTC launch and white paper for more details.
- [2025/01] 🎉 Release Eagle 2 tech report and models.
- [2025/01] 🎉 Eagle is accepted as ICLR 2025 Spotlight.
- [2024/08] 🎉 Release Eagle.
Resources
🌟 Start Here: Set Up Environment, Train the Model, and Run Evaluations
🌐 Playground
- 🤗 Demo on Huggingface Space
- https://huggingface.co/spaces/nvidia/Eagle-2.5-8B-demo
Introduction
Eagle 2.5 is a family of frontier vision-language models (VLMs) designed for long-context multimodal learning. While most existing VLMs focus on short-context tasks, Eagle 2.5 addresses the challenges of long video comprehension and high-resolution image understanding, providing a generalist framework for both. Eagle 2.5 supports up to 512 video frames and is trained jointly on image + video data.
We also introduce Eagle-Video-110K, a novel dataset with both story-level and clip-level annotations, specifically curated for long video understanding. The dataset contains over 110K annotated samples, including QA, localization, and summarization. The videos range from a few minutes to 3 hours - pushing the limits of long-form visual reasoning.
🚀 Strong Results Across The Board:
- SOTA on 6 out of 10 long video benchmarks
- Outperforms GPT-4o (0806) on 3/5 video tasks
- Outperforms Gemini 1.5 Pro on 4/6 video tasks
- Matches or outperforms Qwen2.5-VL-72B on multiple key datasets
- 72.4% on Video-MME with 512 input frames
- Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL.
🎯 Key Innovations
- Information-First Sampling:
- Image Area Preservation (IAP): Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine-grained details.
- Automatic Degrade Sampling (ADS): Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints.
- Progressive Mixed Post-Training:
- Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling.
- Diversity-Driven Data Recipe:
- Combines open-source data (human-annotated and synthetic) with the self-curated Eagle-Video-110K dataset, collected via a diversity-driven strategy and annotated with both story-level and clip-level QA pairs.
⚡ Efficiency & Framework Optimization
- GPU Memory Optimization:
- Integrate Triton-based fused operators replacing PyTorch’s MLP, RMSNorm, and RoPE implementations.
- Reduced GPU memory with fused linear layers + cross-entropy loss (removes intermediate logit storage) and CPU-offloading of hidden states.
- Sufficient to fit up to 32K context length with an 8B model on a single GPU.
- Distributed Context Parallelism:
- Adopts a two-layer communication group based on Ulysses and Ring/Context Parallelism building on USP.
- Implements ZigZag Llama3-style Context Parallelism with all-gather KV to reduce communication latency.
- Video Decoding Acceleration:
- Optimized sparse video frame sampling with rapid video metadata parsing, improved long video decoding and reduced memory consumption.
- Inference Acceleration:
- Supports vLLM deployment with reduced memory and accelerated inference.
Model Details
- Model Type: Long-context vision-language model
- Architecture:
- Vision encoder: Siglip2-So400m-Patch16-512
- Language model: Qwen2.5-7B-Instruct
- Multimodal base architecture: LLaVA with tiling-based vision input
- Supported Inputs:
- Long video sequences (up to 512 frames)
- High-resolution images (up to 4K HD input size)
- Multi-page documents
- Long text
- Training Strategy:
- Progressive mixed post-training, expanding from 32K to 128K context length
- Information-first sampling for optimal visual and textual information retention
- Training Data:
- Open-source video and document datasets
- Eagle-Video-110K (110K long videos with dual-level annotation)
Model Zoo
📦 Eagle 2.5 Models
| Model Name | Date | LLM Backbone | Vision Encoder | Max Length | Download | | ----------- |------------| ---------------- | ---------------- | ---------- | ------- | | Eagle2.5-8B | 2025.04.16 | Qwen2.5-7B-Instruct | SigLIP2 | 128K | 🤗 HF Link |
📦 Eagle 2 Models
| Model Name | Date | LLM Backbone | Vision Encoder | Max Length | Download | | ----------- |------------| ---------------- | ---------------- | ---------- | ------- | | Eagle2-1B | 2025.01.11 | Qwen2.5-0.5B-Instruct | SigLIP | 16K | 🤗 HF Link | | Eagle2-2B | 2025.01.11 | Qwen2.5-1.5B-Instruct | SigLIP | 16K | 🤗 HF Link | | Eagle2-9B | 2025.01.11 | Qwen2.5-7B-Instruct | SigLIP + ConvNext | 16K | 🤗 HF Link | | Eagle2-34B | 2025.01.11 | Qwen2.5-32B-Instruct | SigLIP + ConvNext | 16K | 🤗 HF Link |
Benchmarks Results
🎥 Video Benchmarks
| Benchmark | GPT-4o | Gemini-1.5 Pro | InternVL2.5-8B | Qwen2.5-VL-8B | Eagle2.5-8B | |--------------------------------------------|--------------------|-------------------|---------------------|---------------------|---------------------| | MVBench<sub>test</sub> | - | - | 72.0 | 69.6 | 74.8 | | Perception_test<sub>val</sub> | - | - | - | 70.5 | 82.0 | | EgoSchema<sub>fullset</sub> | - | 72.2 | - | 65.0 | 72.2 | | MMB-Video | 1.63 | 1.30 | 1.68 | 1.79 | 1.94 | | MLVU<sub>val</sub> | - | - | 68.9 | 70.2 | 77.6 | | LVBench<sub>val</sub> | 66.7 | 64.0 | 60.0 | 56.0 | 66.4 | | Video-MME<sub>w/o subtitle</sub> | 71.9 | 75.0 | 64.2 | 65.1 | 72.4 | | Video-MME<sub>w subtitle</sub> | 77.2 | 81.3 | 66.9 | 71.6 | 75.7 | | CG-Bench<sub>Clue</sub> | 58.6 | 50.9 | - | 44.5 | 55.8 | | CG-Bench<sub>Long</sub> | 44.9 | 37.8 | - | 35.5 | 46.6 | | CG-Bench<sub>mIoU</sub> | 5.73 | 3.85 | - | 2.48 | 13.4 | | HourVideo<sub>Dev</sub> | - | 37.2 | - | - | 44.5 | | HourVideo<sub>Test</sub> | - | 37.4 | -
