Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

🎯 Overview

ACE (Agentic Context Engineering) is a framework that enables large language models to self-improve by treating contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. Unlike traditional approaches that suffer from brevity bias and context collapse, ACE introduces structured, incremental updates guided by a grow-and-refine principle, preserving detailed, domain-specific knowledge while remaining comprehensive and scalable throughout adaptation.

Latest News

2025 Nov: ACE Paper and Repo says "Hello World"!

Key Features

🔄 Three-Role Agentic Architecture: Generator, Reflector, and Curator work together to continuously improve contexts
📈 Incremental Delta Updates: Localized edits that preserve prior knowledge while accumulating new insights
🎓 Self-Supervised Learning: Adapts effectively without labeled supervision by leveraging natural execution feedback
🚀 High Efficiency: 86.9% lower adaptation latency on average compared to existing adaptive methods
💰 Cost Effective: Significantly fewer rollouts and lower dollar costs while achieving higher accuracy

Tutorials

📚 Adding Dataset for Evaluation Link
✨ Extending ACE for Tool Calling (Coming Soon)

📊 Performance

ACE consistently outperforms strong baselines, achieving average gains of +10.6% on agent tasks and +8.6% on domain-specific benchmarks, across both offline and online adaptation settings.

Benchmarks

| Task Category | Dataset | Improvement | Details | |---------------|---------|-------------|---------| | Agent Tasks | AppWorld | +10.6% | Matches top-ranked production-level agent (GPT-4.1) on average and surpasses it on harder test-challenge split, using smaller open-source model | | Finance | FiNER + XBRL Formula | +8.6% | Domain-specific reasoning with structured information extraction |

Efficiency Improvements

Offline (AppWorld): -82.3% latency and -75.1% rollouts vs GEPA
Online (FiNER): -91.5% latency and -83.6% token cost vs Dynamic Cheatsheet

How It Works

Generator produces reasoning trajectories for new queries, surfacing both effective strategies and recurring pitfalls
Reflector separates evaluation and insight extraction from curation, improving context quality
Curator converts lessons into structured delta updates with helpful/harmful counters, using deterministic merging with de-duplication and pruning

This design prevents the context collapse problem where iterative rewriting erodes details over time.

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/ace-agent/ace.git
cd ace

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ACE and core dependencies
uv sync

# Set up API keys
cp .env.example .env
# Edit .env and set the API key(s) you need

Basic Usage

from ace import ACE
from utils import initialize_clients

# Initialize API clients
api_provider = "sambanova" # or "together", "openai", "commonstack"

# Initialize ACE system
ace_system = ACE(
    api_provider=api_provider,
    generator_model="DeepSeek-V3.1",
    reflector_model="DeepSeek-V3.1",
    curator_model="DeepSeek-V3.1",
    max_tokens=4096
)

# Prepare configuration
config = {
    'num_epochs': 1,
    'max_num_rounds': 3,
    'curator_frequency': 1,
    'eval_steps': 100,
    'online_eval_frequency': 15,
    'save_steps': 50,
    'playbook_token_budget': 80000,
    'task_name': 'your_task',
    'json_mode': False,
    'no_ground_truth': False,
    'save_dir': './results',
    'test_workers': 20,
    'use_bulletpoint_analyzer': false,
    'api_provider': api_provider

}

# Offline adaptation
results = ace_system.run(
    mode='offline',
    train_samples=train_data,
    val_samples=val_data,
    test_samples=test_data,  # Optional
    data_processor=processor,
    config=config
)

# Online adaptation
results = ace_system.run(
    mode='online',
    test_samples=test_data,
    data_processor=processor,
    config=config
)

# Evaluation only
results = ace_system.run(
    mode='eval_only',
    test_samples=test_data,
    data_processor=processor,
    config=config
)

💼 Finance Domain Example

Training Script Usage

The finance/run.py script provides a unified interface for training and evaluation on financial analysis tasks.

# Offline training (with automatic initial and final testing)
uv run python -m eval.finance.run \
    --task_name finer \
    --mode offline \
    --save_path results

# Online training and testing
uv run python -m eval.finance.run \
    --task_name finer \
    --mode online \
    --save_path results

# Run evaluation on the test split only. Provide a pre-trained playbook or leave initial_playbook_path empty to evaluate an uninitialized playbook.
uv run python -m eval.finance.run \
    --task_name finer \
    --mode eval_only \
    --initial_playbook_path results/ace_run_TIMESTAMP_finer_offline/best_playbook.txt \
    --save_path test_results

# Training with custom configuration
uv run python -m eval.finance.run \
    --task_name finer \
    --mode offline \
    --save_path results \
    --num_epochs 3 \
    --eval_steps 100 \
    --max_tokens 4096

Available Arguments

<details> <summary>Click here to see available arguments</summary>

| Argument | Description | Default | |----------|-------------|---------| | --task_name | Task to train on (e.g., finer, formula) | Required | | --save_path | Directory to save results | Required | | --initial_playbook_path | Path to initial playbook | Optional | | --mode | Run mode: 'offline' for offline training with validation, 'online' for online training and testing on test split, 'eval_only' for evaluation only | offline | | --api_provider | API provider for LLM calls. Choose from ['sambanova', 'together', 'openai', 'commonstack'] | sambanova | | --num_epochs | Number of training epochs | 1 | | --max_num_rounds | Max reflection rounds for incorrect answers | 3 | | --curator_frequency | Run curator every N steps | 1 | | --eval_steps | Evaluate every N steps | 100 | | --online_eval_frequency | Update playbook every N samples for evaluation in online mode | 15 | | --save_steps | Save intermediate playbooks every N steps | 50 | | --max_tokens | Maximum tokens for LLM responses | 4096 | | --playbook_token_budget | Total token budget for playbook | 80000 | | --test_workers | Number of parallel workers for testing | 20 | | --generator_model | Model for generator | DeepSeek-V3.1 | | --reflector_model | Model for reflector | DeepSeek-V3.1 | | --curator_model | Model for curator | DeepSeek-V3.1 | | --json_mode | Enable JSON mode for structured output | False | | --no_ground_truth | Don't use ground truth in reflection | False | | --use_bulletpoint_analyzer | Enable bulletpoint analyzer for playbook deduplication and merging | False | | --bulletpoint_analyzer_threshold | Similarity threshold for bulletpoint analyzer (0-1) | 0.9 |

</details>

📈 Results and Outputs

Using offline training as an example, after training, ACE generates:

results/
└── ace_run_TIMESTAMP_finer_offline/
    ├── run_config.json                # Training configuration
    ├── final_results.json             # Consolidated results from all stages
    ├── initial_test_results.json      # Initial test results with empty playbook (baseline)
    ├── final_test_results.json        # Final test results with best playbook
    ├── train_results.json             # Training results
    ├── val_results.json               # Validation results and error logs
    ├── pre_train_post_train_results.json     # Detailed pre-train and post-train generator output for each training sample
    ├── final_playbook.txt             # Final evolved context
    ├── best_playbook.txt              # Best performing context (only for offline training)
    ├── bullet_usage_log.jsonl         # Bullet usage tracking
    ├── curator_operations_diff.jsonl  # Curator operation tracking
    ├── detailed_llm_logs/             # Detailed LLM call logs
    └── intermediate_playbooks/        # Intermediate playbooks

Understanding Playbook Format

The evolved context (playbook) follows this structure:

## STRATEGIES & INSIGHTS
[str-00001] helpful=5 harmful=0 :: Always verify data types before processing
[str-00002] helpful=3 harmful=1 :: Consider edge cases in financial data

## FORMULAS & CALCULATIONS
[cal-00003] helpful=

Ace

Install / Use

README