SkillAgentSearch skills...

Eagle

Eagle: Frontier Vision-Language Models with Data-Centric Strategies

Install / Use

/learn @NVlabs/Eagle
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

🦅 Eagle: Frontier Vision-Language Models with Data-Centric Strategies

<p> <img src="Eagle/assets/Eagle.png" alt="Eagle" width="500" height="auto"> </p>

Code License Model License

[📘Eagle 2.5 Report] [📘Eagle 2 Report] [📘Eagle Report] [🤗HF Models] [🤗HF Demo] [🌐Project Page]

</div>

Updates

Resources

🌟 Start Here: Set Up Environment, Train the Model, and Run Evaluations

🌐 Playground

  • 🤗 Demo on Huggingface Space
    • https://huggingface.co/spaces/nvidia/Eagle-2.5-8B-demo

Introduction

Eagle 2.5 is a family of frontier vision-language models (VLMs) designed for long-context multimodal learning. While most existing VLMs focus on short-context tasks, Eagle 2.5 addresses the challenges of long video comprehension and high-resolution image understanding, providing a generalist framework for both. Eagle 2.5 supports up to 512 video frames and is trained jointly on image + video data.

We also introduce Eagle-Video-110K, a novel dataset with both story-level and clip-level annotations, specifically curated for long video understanding. The dataset contains over 110K annotated samples, including QA, localization, and summarization. The videos range from a few minutes to 3 hours - pushing the limits of long-form visual reasoning.

🚀 Strong Results Across The Board:

  • SOTA on 6 out of 10 long video benchmarks
  • Outperforms GPT-4o (0806) on 3/5 video tasks
  • Outperforms Gemini 1.5 Pro on 4/6 video tasks
  • Matches or outperforms Qwen2.5-VL-72B on multiple key datasets
  • 72.4% on Video-MME with 512 input frames
  • Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5-VL.

🎯 Key Innovations

  • Information-First Sampling:
    • Image Area Preservation (IAP): Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine-grained details.
    • Automatic Degrade Sampling (ADS): Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints.
  • Progressive Mixed Post-Training:
    • Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling.
  • Diversity-Driven Data Recipe:
    • Combines open-source data (human-annotated and synthetic) with the self-curated Eagle-Video-110K dataset, collected via a diversity-driven strategy and annotated with both story-level and clip-level QA pairs.

⚡ Efficiency & Framework Optimization

  • GPU Memory Optimization:
    • Integrate Triton-based fused operators replacing PyTorch’s MLP, RMSNorm, and RoPE implementations.
    • Reduced GPU memory with fused linear layers + cross-entropy loss (removes intermediate logit storage) and CPU-offloading of hidden states.
    • Sufficient to fit up to 32K context length with an 8B model on a single GPU.
  • Distributed Context Parallelism:
    • Adopts a two-layer communication group based on Ulysses and Ring/Context Parallelism building on USP.
    • Implements ZigZag Llama3-style Context Parallelism with all-gather KV to reduce communication latency.
  • Video Decoding Acceleration:
    • Optimized sparse video frame sampling with rapid video metadata parsing, improved long video decoding and reduced memory consumption.
  • Inference Acceleration:
    • Supports vLLM deployment with reduced memory and accelerated inference.

Model Details

  • Model Type: Long-context vision-language model
  • Architecture:
    • Vision encoder: Siglip2-So400m-Patch16-512
    • Language model: Qwen2.5-7B-Instruct
    • Multimodal base architecture: LLaVA with tiling-based vision input
  • Supported Inputs:
    • Long video sequences (up to 512 frames)
    • High-resolution images (up to 4K HD input size)
    • Multi-page documents
    • Long text
  • Training Strategy:
    • Progressive mixed post-training, expanding from 32K to 128K context length
    • Information-first sampling for optimal visual and textual information retention
  • Training Data:
    • Open-source video and document datasets
    • Eagle-Video-110K (110K long videos with dual-level annotation)

Model Zoo

📦 Eagle 2.5 Models

| Model Name | Date | LLM Backbone | Vision Encoder | Max Length | Download | | ----------- |------------| ---------------- | ---------------- | ---------- | ------- | | Eagle2.5-8B | 2025.04.16 | Qwen2.5-7B-Instruct | SigLIP2 | 128K | 🤗 HF Link |

📦 Eagle 2 Models

| Model Name | Date | LLM Backbone | Vision Encoder | Max Length | Download | | ----------- |------------| ---------------- | ---------------- | ---------- | ------- | | Eagle2-1B | 2025.01.11 | Qwen2.5-0.5B-Instruct | SigLIP | 16K | 🤗 HF Link | | Eagle2-2B | 2025.01.11 | Qwen2.5-1.5B-Instruct | SigLIP | 16K | 🤗 HF Link | | Eagle2-9B | 2025.01.11 | Qwen2.5-7B-Instruct | SigLIP + ConvNext | 16K | 🤗 HF Link | | Eagle2-34B | 2025.01.11 | Qwen2.5-32B-Instruct | SigLIP + ConvNext | 16K | 🤗 HF Link |

Benchmarks Results

🎥 Video Benchmarks

| Benchmark | GPT-4o | Gemini-1.5 Pro | InternVL2.5-8B | Qwen2.5-VL-8B | Eagle2.5-8B | |--------------------------------------------|--------------------|-------------------|---------------------|---------------------|---------------------| | MVBench<sub>test</sub> | - | - | 72.0 | 69.6 | 74.8 | | Perception_test<sub>val</sub> | - | - | - | 70.5 | 82.0 | | EgoSchema<sub>fullset</sub> | - | 72.2 | - | 65.0 | 72.2 | | MMB-Video | 1.63 | 1.30 | 1.68 | 1.79 | 1.94 | | MLVU<sub>val</sub> | - | - | 68.9 | 70.2 | 77.6 | | LVBench<sub>val</sub> | 66.7 | 64.0 | 60.0 | 56.0 | 66.4 | | Video-MME<sub>w/o subtitle</sub> | 71.9 | 75.0 | 64.2 | 65.1 | 72.4 | | Video-MME<sub>w subtitle</sub> | 77.2 | 81.3 | 66.9 | 71.6 | 75.7 | | CG-Bench<sub>Clue</sub> | 58.6 | 50.9 | - | 44.5 | 55.8 | | CG-Bench<sub>Long</sub> | 44.9 | 37.8 | - | 35.5 | 46.6 | | CG-Bench<sub>mIoU</sub> | 5.73 | 3.85 | - | 2.48 | 13.4 | | HourVideo<sub>Dev</sub> | - | 37.2 | - | - | 44.5 | | HourVideo<sub>Test</sub> | - | 37.4 | -

View on GitHub
GitHub Stars938
CategoryDevelopment
Updated1d ago
Forks50

Languages

Python

Security Score

100/100

Audited on Mar 31, 2026

No findings