🦅 Eagle: Frontier Vision-Language Models with Data-Centric Strategies

[📘Eagle 2.5 Report] [📘Eagle 2 Report] [📘Eagle Report] [🤗HF Models] [🤗HF Demo] [🌐Project Page]

</div>

Updates

[2025/10] 🔥 Release Eagle 2.5 source code.
[2025/09] 🔥 Eagle 2.5 is accepted to NeurIPS 2025.
[2025/09] 🎉 Eagle 2 is supported in Torch-TRT.
[2025/07] 🎉 Release Eagle 2.5 model.
[2025/06] 🔥 Eagle 2.5 is adopted as the VLM backbone of GR00T-N1.5. Check out the tech blog for more details.
[2025/04] 🎉 Release Eagle 2.5 tech report.
[2025/03] 🔥 Eagle 2 is adopted as the VLM backbone (System-2) of GR00T-N1. Check out the GTC launch and white paper for more details.
[2025/01] 🎉 Release Eagle 2 tech report and models.
[2025/01] 🎉 Eagle is accepted as ICLR 2025 Spotlight.
[2024/08] 🎉 Release Eagle.

Resources

🌟 Start Here: Set Up Environment, Train the Model, and Run Evaluations

📚 Getting Started with Eagle-2.5

🌐 Playground

🤗 Demo on Huggingface Space
- https://huggingface.co/spaces/nvidia/Eagle-2.5-8B-demo

Introduction

Eagle 2.5 is a family of frontier vision-language models (VLMs) designed for long-context multimodal learning. While most existing VLMs focus on short-context tasks, Eagle 2.5 addresses the challenges of long video comprehension and high-resolution image understanding, providing a generalist framework for both. Eagle 2.5 supports up to 512 video frames and is trained jointly on image + video data.

We also introduce Eagle-Video-110K, a novel dataset with both story-level and clip-level annotations, specifically curated for long video understanding. The dataset contains over 110K annotated samples, including QA, localization, and summarization. The videos range from a few minutes to 3 hours - pushing the limits of long-form visual reasoning.

🚀 Strong Results Across The Board:

SOTA on 6 out of 10 long video benchmarks
Outperforms GPT-4o (0806) on 3/5 video tasks
Outperforms Gemini 1.5 Pro on 4/6 video tasks
Matches or outperforms Qwen2.5-VL-72B on multiple key datasets
72.4% on Video-MME with 512 input frames
Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL.

🎯 Key Innovations

Information-First Sampling:
- Image Area Preservation (IAP): Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine-grained details.
- Automatic Degrade Sampling (ADS): Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints.
Progressive Mixed Post-Training:
- Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling.
Diversity-Driven Data Recipe:
- Combines open-source data (human-annotated and synthetic) with the self-curated Eagle-Video-110K dataset, collected via a diversity-driven strategy and annotated with both story-level and clip-level QA pairs.

⚡ Efficiency & Framework Optimization

GPU Memory Optimization:
- Integrate Triton-based fused operators replacing PyTorch’s MLP, RMSNorm, and RoPE implementations.
- Reduced GPU memory with fused linear layers + cross-entropy loss (removes intermediate logit storage) and CPU-offloading of hidden states.
- Sufficient to fit up to 32K context length with an 8B model on a single GPU.
Distributed Context Parallelism:
- Adopts a two-layer communication group based on Ulysses and Ring/Context Parallelism building on USP.
- Implements ZigZag Llama3-style Context Parallelism with all-gather KV to reduce communication latency.
Video Decoding Acceleration:
- Optimized sparse video frame sampling with rapid video metadata parsing, improved long video decoding and reduced memory consumption.
Inference Acceleration:
- Supports vLLM deployment with reduced memory and accelerated inference.

Model Details

Model Type: Long-context vision-language model
Architecture:
- Vision encoder: Siglip2-So400m-Patch16-512
- Language model: Qwen2.5-7B-Instruct
- Multimodal base architecture: LLaVA with tiling-based vision input
Supported Inputs:
- Long video sequences (up to 512 frames)
- High-resolution images (up to 4K HD input size)
- Multi-page documents
- Long text
Training Strategy:
- Progressive mixed post-training, expanding from 32K to 128K context length
- Information-first sampling for optimal visual and textual information retention
Training Data:
- Open-source video and document datasets
- Eagle-Video-110K (110K long videos with dual-level annotation)

Model Zoo

📦 Eagle 2.5 Models

| Model Name | Date | LLM Backbone | Vision Encoder | Max Length | Download | | ----------- |------------| ---------------- | ---------------- | ---------- | ------- | | Eagle2.5-8B | 2025.04.16 | Qwen2.5-7B-Instruct | SigLIP2 | 128K | 🤗 HF Link |

📦 Eagle 2 Models

| Model Name | Date | LLM Backbone | Vision Encoder | Max Length | Download | | ----------- |------------| ---------------- | ---------------- | ---------- | ------- | | Eagle2-1B | 2025.01.11 | Qwen2.5-0.5B-Instruct | SigLIP | 16K | 🤗 HF Link | | Eagle2-2B | 2025.01.11 | Qwen2.5-1.5B-Instruct | SigLIP | 16K | 🤗 HF Link | | Eagle2-9B | 2025.01.11 | Qwen2.5-7B-Instruct | SigLIP + ConvNext | 16K | 🤗 HF Link | | Eagle2-34B | 2025.01.11 | Qwen2.5-32B-Instruct | SigLIP + ConvNext | 16K | 🤗 HF Link |

Benchmarks Results

🎥 Video Benchmarks

| Benchmark | GPT-4o | Gemini-1.5 Pro | InternVL2.5-8B | Qwen2.5-VL-8B | Eagle2.5-8B | |--------------------------------------------|--------------------|-------------------|---------------------|---------------------|---------------------| | MVBenchtest | - | - | 72.0 | 69.6 | 74.8 | | Perception_testval | - | - | - | 70.5 | 82.0 | | EgoSchemafullset | - | 72.2 | - | 65.0 | 72.2 | | MMB-Video | 1.63 | 1.30 | 1.68 | 1.79 | 1.94 | | MLVUval | - | - | 68.9 | 70.2 | 77.6 | | LVBenchval | 66.7 | 64.0 | 60.0 | 56.0 | 66.4 | | Video-MMEw/o subtitle | 71.9 | 75.0 | 64.2 | 65.1 | 72.4 | | Video-MMEw subtitle | 77.2 | 81.3 | 66.9 | 71.6 | 75.7 | | CG-BenchClue | 58.6 | 50.9 | - | 44.5 | 55.8 | | CG-BenchLong | 44.9 | 37.8 | - | 35.5 | 46.6 | | CG-BenchmIoU | 5.73 | 3.85 | - | 2.48 | 13.4 | | HourVideoDev | - | 37.2 | - | - | 44.5 | | HourVideoTest | - | 37.4 | -

Eagle

Install / Use

README