SkillAgentSearch skills...

RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?

Install / Use

/learn @NVIDIA/RULER
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

📏 RULER: What’s the Real Context Size of Your Long-Context Language Models?

This repository contains code for our paper RULER: What’s the Real Context Size of Your Long-Context Language Models. RULER generates synthetic examples to evaluate long-context language models with configurable sequence length and task complexity. We benchmark 17 open-source models across 4 task categories (in total 13 tasks) in RULER, evaluating long-context capabilities beyond simple in-context recall. Here are our main results.

|Models|Claimed Length|Effective Length|4K|8K|16K|32K|64K|128K|Avg.|wAvg. (inc)|wAvg. (dec)| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |Llama2 (7B)|4K||85.6| Jamba-1.5-large* (94B/398B)|256k|>128k|<ins>96.7</ins>|<ins>96.6</ins>|<ins>96.4</ins>|<ins>96.0</ins>|<ins>95.4</ins>|<ins>95.1</ins>|96.0|95.7 (1st)|96.3 (1st)| Gemini-1.5-pro|1M|>128K|<ins>96.7</ins>|<ins>95.8</ins>|<ins>96.0</ins>|<ins>95.9</ins>|<ins>95.9</ins>|<ins>94.4</ins>|95.8|95.5 (2nd)|96.1 (2nd)| Qwen2.5-14B-Instruct-1M* (14B)|1M|>128K|<ins>97.5</ins>|<ins>97.1</ins>|<ins>94.6</ins>|<ins>94.9</ins>|<ins>94.9</ins>|<ins>92.2</ins>|95.7|TBD|TBD| Qwen3-235B-A22B* (235B)|128K|>128K|<ins>97.7</ins>|<ins>97.2</ins>|<ins>96.4</ins>|<ins>95.1</ins>|<ins>93.3</ins>|<ins>90.6</ins>|95.0|TBD|TBD| Qwen3-14B* (14B)|128K|>128K|<ins>98.0</ins>|<ins>97.8</ins>|<ins>96.4</ins>|<ins>96.1</ins>|<ins>94.0</ins>|<ins>85.1</ins>|94.6|TBD|TBD| Jamba-1.5-mini (12B/52B)|256K|>128K|<ins>95.6</ins>|<ins>95.6</ins>|<ins>94.8</ins>|<ins>94.6</ins>|<ins>92.8</ins>|<ins>90.0</ins>|93.9|93.1 (3rd)|94.8 (3rd) Qwen3-32B* (32B)|128K|>128K|<ins>98.4</ins>|<ins>96.0</ins>|<ins>96.2</ins>|<ins>94.4</ins>|<ins>91.8</ins>|<ins>85.6</ins>|93.7|TBD|TBD| EXAONE-4.0-32B* (32B)|128K|>128K|<ins>96.3</ins>|<ins>94.9</ins>|<ins>93.9</ins>|<ins>93.6</ins>|<ins>91.7</ins>|<ins>88.2</ins>|93.1|TBD|TBD| Qwen2.5-7B-Instruct-1M* (7B)|1M|>128K|<ins>96.8</ins>|<ins>95.3</ins>|<ins>93.0</ins>|<ins>91.1</ins>|<ins>90.4</ins>|84.4|91.8|TBD|TBD| Qwen3-30B-A3B* (30B)|128K|64K|<ins>96.5</ins>|<ins>97.0</ins>|<ins>95.3</ins>|<ins>92.4</ins>|<ins>89.1</ins>|79.2|91.6|TBD|TBD| GPT-4-1106-preview|128K|64K|<ins>96.6</ins>|<ins>96.3</ins>|<ins>95.2</ins>|<ins>93.2</ins>|<ins>87.0</ins>|81.2|91.6|89.0 (4th)|94.1 (4th)| Llama3.1 (70B)|128K|64K|<ins>96.5</ins>|<ins>95.8</ins>|<ins>95.4</ins>|<ins>94.8</ins>|<ins>88.4</ins>|66.6|89.6|85.5 (10th)|93.7 (5th)| Qwen3-8B* (8B)|128K|64K|<ins>96.3</ins>|<ins>96.0</ins>|<ins>91.8</ins>|<ins>91.2</ins>|82.1|77.4|89.1|TBD|TBD| Mistral-Large-2411 (123B)|128K|64K|<ins>96.4</ins>|<ins>96.3</ins>|<ins>95.3</ins>|<ins>94.0</ins>|<ins>85.9</ins>|48.1|86.0|79.5 (18th)|92.5 (6th)| Command-R-plus-0824 (104B)|128K|32K|<ins>96.0</ins>|<ins>95.1</ins>|<ins>94.0</ins>|<ins>92.4</ins>|85.4|64.6|87.9|83.4 (13th)|92.4 (7th)| Qwen2 (72B)|128K|32K|<ins>96.9</ins>|<ins>96.1</ins>|<ins>94.9</ins>|<ins>94.1</ins>|79.8|53.7|85.9|79.6 (17th)|92.3 (8th)| Command-R-plus (104B)|128K|32K|<ins>95.6</ins>|<ins>95.2</ins>|<ins>94.2</ins>|<ins>92.0</ins>|84.3|63.1|87.4|82.7 (14th)|92.1 (9th)| Command-R-0824 (32B)|128K|64K|<ins>94.7</ins>|<ins>93.7</ins>|<ins>93.1</ins>|<ins>90.8</ins>|<ins>86.6</ins>|74.7|88.9|86.0 (8th)|91.9 (10th)| GLM4 (9B)|1M|64K|<ins>94.7</ins>|<ins>92.8</ins>|<ins>92.1</ins>|<ins>89.9</ins>|<ins>86.7</ins>|83.1|89.9|88.0 (5th)|91.7 (11th)| Llama3.1 (8B)|128K|32K|<ins>95.5</ins>|<ins>93.8</ins>|<ins>91.6</ins>|<ins>87.4</ins>|84.7|77.0|88.3|85.4 (11th)|91.3 (12th)| ProLong (8B)|512K|32K|<ins>94.5</ins>|<ins>92.5</ins>|<ins>92.3</ins>|<ins>89.3</ins>|83.2|81.6|88.9|86.6 (7th)|91.2 (13th)| Command-R (35B)|128K|32K|<ins>93.8</ins>|<ins>93.3</ins>|<ins>92.4</ins>|<ins>89.5</ins>|84.9|76.0|88.3|85.5 (9th)|91.1 (14th)| MegaBeam-Mistral (7B)|512K|32K|<ins>93.8</ins>|<ins>92.5</ins>|<ins>92.0</ins>|<ins>89.2</ins>|83.7|83.7|89.1|87.3 (6th)|91.0 (15th)| Mistral-Large-2407 (123B)|128K|32K|<ins>96.2</ins>|<ins>96.1</ins>|<ins>95.1</ins>|<ins>93.0</ins>|78.8|23.7|80.5|70.6 (24th)|90.4 (16th)| GradientAI/Llama3 (70B)|1M|16K|<ins>95.1</ins>|<ins>94.4</ins>|<ins>90.8</ins>|85.4|80.9|72.1|86.5|82.6 (15th)|90.3 (17th)| Mixtral-8x22B (39B/141B)|64K|32K|<ins>95.6</ins>|<ins>94.9</ins>|<ins>93.4</ins>|<ins>90.9</ins>|84.7|31.7|81.9|73.5 (22nd)|90.3 (18th)| Yi (34B)|200K|32K|<ins>93.3</ins>|<ins>92.2</ins>|<ins>91.3</ins>|<ins>87.5</ins>|83.2|77.3|87.5|84.8 (12th)|90.1 (19th)| Qwen3-4B* (4B)|128K|64K|<ins>95.1</ins>|<ins>93.6</ins>|<ins>91.0</ins>|<ins>87.8</ins>|77.8|66.0|85.2|TBD|TBD| EXAONE-4.0-1.2B* (1.2B)|64K|32K|<ins>87.0</ins>|<ins>86.7</ins>|<ins>88.8</ins>|81.1|77.4|-|84.2|TBD|TBD| Phi3-mini (3.8B)|128K|32K|<ins>92.2</ins>|<ins>91.5</ins>|<ins>90.7</ins>|<ins>87.5</ins>|80.6|66.7|84.8|80.9 (16th)|88.7 (20th)| Phi3-medium (14B)|128K|32K|<ins>93.3</ins>|<ins>93.2</ins>|<ins>91.1</ins>|<ins>86.8</ins>|78.6|46.1|81.5|74.8 (21st)|88.3 (21st)| Mixtral-8x7B (12.9B/46.7B)|32K|32K|<ins>94.9</ins>|<ins>92.1</ins>|<ins>92.5</ins>|<ins>85.9</ins>|72.4|44.5|80.4|72.8 (23rd)|87.9 (22nd)| GradientAI/Llama3 (8B)|1M|16K|<ins>92.8</ins>|<ins>90.3</ins>|<ins>85.7</ins>|79.9|76.3|69.5|82.4|78.5 (19th)|86.3 (23rd)| FILM-7B* (7B)|32K|32K|<ins>92.8</ins>|<ins>88.2</ins>|<ins>88.1</ins>|<ins>86.9</ins>|70.1|27.1|75.5|66.4 (26th)|84.7 (24th)| InternLM2.5 (7B)|1M|4K|<ins>88.1</ins>|85.5|84.5|82.7|75.5|68.9|80.9| 77.8 (20th)|83.9 (25th)| Mistral (7B)|32K|16K|<ins>93.6</ins>|<ins>91.2</ins>|<ins>87.2</ins>|75.4|49.0|13.8|68.4|55.6 (28th)|81.2 (26th)| Mistral-Nemo|128K|16K|<ins>87.8</ins>|<ins>87.2</ins>|<ins>87.7</ins>|69.0|46.8|19.0|66.2|54.7 (29th)|77.8 (27th)| GLM3 (6B)|128K|4K|<ins>87.8</ins>|83.4|78.6|69.9|56.0|42.0|69.6|62.0 (27th)|77.2 (28th)| LWM (7B)|1M|<4K|82.3|78.4|73.7|69.1|68.1|65.0|72.8|69.9 (25th)|75.7 (29th)| DBRX (36B/132B)|32K|8K|<ins>95.1</ins>|<ins>93.8</ins>|83.6|63.1|2.4|0.0|56.3|38.0 (30th)|74.7 (30th)| Qwen1.5 (72B)|32K|8K|<ins>94.9</ins>|<ins>93.8</ins>|78.0|67.8|0.0|0.0|55.7|37.5 (31st)|74.0 (31st)| Together (7B)|32K|4K|<ins>88.2</ins>|81.1|69.4|63.0|0.0|0.0|50.3|33.8 (32nd)|66.7 (32nd)| LongChat (7B)|32K|<4K|84.7|79.9|70.8|59.3|0.0|0.0|49.1|33.1 (33rd)|65.2 (33rd)| LongAlpaca (13B)| 32K|<4K|60.6|57.0|56.6|43.6|0.0|0.0|36.3|24.7 (34th)|47.9 (34th)|

  • Despite achieving nearly perfect performance on the vanilla needle-in-a-haystack (NIAH) test, most models exhibit large degradation on tasks in RULER as sequence length increases.
  • While all models claim context size of 32k tokens or greater, only half of them can effectively handle sequence length of 32K by exceeding a qualitative threshold, Llama-2-7b performance at 4K (85.6%). The performance exceeding the threshold is <ins>underlined</ins>.
  • Almost all models fall below the threshold before reaching the claimed context lengths.
  • Notes
    • Jamba-1.5-large results are reported by authors from this report.
    • FILM-7B results are reported by authors of this paper. They use YaRN without
View on GitHub
GitHub Stars1.5k
CategoryDevelopment
Updated3h ago
Forks125

Languages

Python

Security Score

95/100

Audited on Apr 6, 2026

No findings