EAGer

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

Generate Convert Improve

Install / Use

/learn @DanielSc4/EAGer

About this skill

Quality Score

0/100

README

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

📜 Read the paper here 📜 · 🗣️ I want to cite you 🗣️

Daniel Scalena1,2 · Leonidas Zotos1 · Elisabetta Fersini2 · Malvina Nissim1 · Ahmet Üstün3,4

1University of Groningen · 2University of Milano-Bicocca · 3Cohere Labs · 4Cohere

🌍 Overview

EAGer is a training-free method that dynamically adjusts compute at inference time based on token-level uncertainty. Instead of allocating the same compute budget for every prompt, EAGer branches into multiple reasoning paths only when the model encounters high-entropy tokens—indicating uncertainty.

🔑 Key Benefits:

✨ Reduces redundant computation by up to 65%
🎯 Improves reasoning performance by up to 37% in Pass@k
⚡ Enables adaptive scaling without additional training
🔄 Automatically reallocates saved compute to more complex instances

⚙️ Quick Start

1. Installation

Install dependencies using UV (recommended):

uv pip install -e .

Or using pip:

pip install -e .

2. Running Experiments

Execute experiments using the src.main_vllm script:

python -m src.main_vllm \
    --model_name "openai/gpt-oss-20b" \
    --data_name "opencompass/AIME2025" \
    --temperature 0.6 \
    --entropy_threshold 2.2 \
    --max_sequences 32 \
    --experiments "eager" \
    --output_dir "$output_dir" \
    --seed 55

➡️ example_execution.sh provides a simple example to run experiments following the same format as above.

⚙️ Configuration

Core Parameters

| Parameter | Description | Example | |-----------|-------------|---------| | model_name | HuggingFace model identifier | openai/gpt-oss-20b | | data_name | HuggingFace dataset identifier | opencompass/AIME2025 | | temperature | Sampling temperature (must be > 0) | 0.6 | | entropy_threshold | Branching threshold (~2.0 for EAGer-init, ~2.4 for EAGer) | 2.2 | | max_sequences | Maximum parallel sequences | 32 | | experiments | Experiment type to run | eager | | output_dir | Output directory (serves as experiment ID) | See below | | seed | Random seed for reproducibility | 55 |

Supported Models

HuggingFaceTB/SmolLM3-3B
Qwen/Qwen3-4B
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
openai/gpt-oss-20b

💡: Any model supported by vLLM should work.

Supported Datasets

Maxwell-Jia/AIME_2024 (math)
opencompass/AIME2025 (math)
MathArena/hmmt_feb_2025 (math)
fingertap/GPQA-Diamond (science)
evalplus/humanevalplus (code generation)

Experiment Types

| Type | Description | Prerequisites | |------|-------------|---------------| | parallel | Standard parallel sampling baseline | None | | eager_init | Initial EAGer with fixed threshold | None | | eager_adapt | Adaptive threshold adjustment | Requires eager_init | | eager | Full EAGer with dynamic compute allocation | Requires eager_init | | all | Run all experiments sequentially | None |

Output Directory Structure

Experiments are saved to: outputs/{model_name}/{data_name}/{timestamp}/

Example: For Qwen/Qwen3-4B on opencompass/AIME2025, outputs are placed in:
outputs/Qwen3-4B/AIME2025/2025-10-01_12-00-01/

Auto-create directory:

model_name="Qwen/Qwen3-4B"
data_name="opencompass/AIME2025"
timestamp=$(date +%F_%H-%M-%S)

model_dir_name=$(basename "$model_name")
data_dir_name=$(basename "$data_name")
output_dir="outputs/${model_dir_name}/${data_dir_name}/${timestamp}"
mkdir -p "$output_dir"

Then pass it to the script using --output_dir "$output_dir".

Output JSON structure

Every experiment produces a JSON file stored in the specified output_dir. This file logs all metadata, configurations, and results of the experiment.

<details> <summary>📀 <ins>Click to reveal the general structure of the JSON output</ins></summary>

```json
{
    // Experiment metadata and parameters
    "model_name": "Qwen/Qwen3-4B",
    "temperature": 0.6,
    "entropy_threshold": 2.5,
    "max_sequences": 32,
    "seed": 55,
    
    // Checkpoint state (to allow resuming if interrupted)
    "status": {
        "state": "COMPLETED",
        "completed_sequences": 30,
        "total_sequences": 30,
        "progress": "30/30"
    },
    
    // Extra information
    "extra": {
        "notes": "",
        "generation_time_so_far (s)": 33938.9826028347
    },
    
    // Automatically computed metrics
    "results": {
        "avg_acc": 0.89,
        "pass_at_1": 0.95,
        "cons_at_max": 0.68
    },

    // Entropy-aware generation details for each prompt, "default-generations" for Full Parallel sampling
    "entropy-aware-generations": [
        {
            // original prompt
            "prompt": "Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.",
            // number of actual sequences -> max at M (Full Parallel or EAGer-init) or M + b (EAGer-adapt or EAGer)
            "generated_sequences": 1,
            "target": "70",
            // extracted answer from \boxed{} env.
            "extracted_answers": "['70', '70']",
            // List of all generated reasoning paths
            "generations": [
                "<|im_start|>user\nFind the sum of all integer bases $b>9$ [...] \n### Final Answer\n\n$$\n\\boxed{70}\n$$",
                "<|im_start|>user\nFind the sum of all integer bases $b>9$ [...] \nFinal answer is: $$\n\\boxed{70}\n$$"
            ],
            // Token-level entropy values for each sequence
            "entropies": [
                "[0.0002, 0.0, 0.0362, 0.1260 [...], -2.0]",
                "[-1, -1, -1, 0.1260 [...], -2.0]"      // this is a branched sequence, -1 = token reused from parent sequence
            ],
            // Records of all branching events (if any)
            "recorded_branches": [
                    "{'seq_idx': 0, 'gen_step': 315, 'auto_selected_token': ' sym', 'entropy': 2.557540054321289, 'entropy_threshold': 2.5, 'man_token_original': ' sym', 'man_token_branch': ' favorable', 'new_seq_idx': 1, 'last_generated_chars': ' number of total possibilities, by considering'}",
            ]
        }
        // ... additional prompts
    ]
}
```

</details>

Each JSON file is optimized for human readability and easy access.

A corresponding .large summary file can be generated using:

python script_manual_parallel_recapper.py <model_name> <data_name> <timestamp>

(see the 🧩 Evaluation section below for details).

Advanced Parameters

--dtype "bfloat16"                    # Model precision
--max_model_len 32768                 # Maximum context length
--gpu_memory_utilization 0.8          # GPU memory usage (0.0-1.0)
--device "cuda"                       # Device to use

🧩 Evaluation

Automatic Evaluation

Evaluation for Maxwell-Jia/AIME_2024, opencompass/AIME2025, and fingertap/GPQA-Diamond runs automatically during experiments. Answers are extracted from \boxed{} environments to compute Pass@k, Cons@k, and PassRate metrics.

Code Generation Evaluation

For evalplus/humanevalplus, use the provided evaluation script:

⚠️ Warning: This script executes generated code locally without sandboxing, which poses security risks.

python script_eval_code_gen.py Qwen3-4B AIME2025 2025-10-01_12-00-01

For advanced evaluation, refer to the evalplus documentation.

Summary of Results

To recap results from any experiment folder:

python script_manual_parallel_recapper.py Qwen3-4B AIME2025 2025-10-01_12-00-01

Citation

@misc{scalena2025eagerentropyawaregenerationadaptive,
      title={EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling}, 
      author={Daniel Scalena and Leonidas Zotos and Elisabetta Fersini and Malvina Nissim and Ahmet Üstün},
      year={2025},
      eprint={2510.11170},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.11170}, 
}

🧾 Abstract

With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt.

Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance.

EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed.

We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving

Related Skills

node-connect

335.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

335.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.7k

Commit, push, and open a PR