AgentMemoryBench
A unified benchmark for evaluating continual agent memory in LLM-based systems across 5 evaluation modes (Online, Offline, Replay, Transfer, Repair) and 6 interactive tasks, supporting both system and personal memory mechanisms.
Install / Use
/learn @s010m00n/AgentMemoryBenchREADME
<h2 id="overview">🎯 Overview</h2>
AgentMemoryBench provides a unified framework to evaluate how LLM agents learn and retain two types of memory:
- System Memory: Task workflows and execution patterns
- Personal Memory: User preferences and dialogue context
The benchmark spans 6 interactive tasks across 4 grounding types:
- Code-grounded: Database (SQL), Operating System (Shell), Knowledge Graph (SPARQL)
- Embodied: ALFWorld (household tasks)
- Web-grounded: WebShop (e-commerce)
- Dialogue-grounded: LoCoMo (long-term conversations)
AgentMemoryBench supports 5 complementary evaluation modes to provide multi-dimensional assessment of memory systems:

1. Offline Mode
Traditional train-test split evaluation. The agent learns from training samples (memory formation & evolution) and is tested on held-out samples (retrieval only).
Metrics: Average Success Rate (ASR), Average Steps (AS), F1-score, BLEU, LLM-as-Judge
2. Online Mode
Streaming evaluation where agents process samples sequentially with real-time memory updates. Performance is recorded after each sample to capture learning dynamics.
Metrics: Cumulative Success Rate (CSR), Learning Gain (LG), Stability Loss (SL)
3. Replay Mode
Periodic testing to measure knowledge retention and resistance to forgetting. After learning each stage, agents are tested on previously learned samples.
Metrics: Forgetting Rate (FR), Average Success Rate (ASR)
4. Transfer Mode
- Cross-environment: Tests knowledge generalization across different domains (e.g., DB→OS)
- Within-environment: Measures forward transfer—how learning current samples helps future ones
Metrics: Transfer Gain (TG), Forward Transfer Gain (FTG)
5. Repair Mode
Tests robustness and self-correction under erroneous feedback. Agents learn under incorrect rewards, then repair memory with correct feedback.
Metrics: Error Robustness (ER), Repair Gain (RG), Net Recovery (NR)
🏗️ Project Structure
AgentMemoryBench/
├── configs/ # Configuration files
│ ├── assignment/ # Experiment configurations
│ │ └── default.yaml # Main experiment config
│ ├── tasks/ # Task-specific configs (6 tasks)
│ │ ├── dbbench.yaml # Database (SQL)
│ │ ├── os.yaml # Operating System (Shell)
│ │ ├── kg.yaml # Knowledge Graph (SPARQL)
│ │ ├── alfworld.yaml # Embodied AI
│ │ ├── webshop.yaml # E-commerce
│ │ └── locomo-*.yaml # Long conversations (0-9)
│ └── llmapi/ # LLM API configurations
│ ├── api.yaml # API endpoint & key for agent LLM
│ ├── agent.yaml # Agent model name
│ ├── evaluate_api.yaml # API for LoCoMo LLM-as-Judge
│ └── evaluate_agent.yaml# Model for evaluation
│
├── data/ # Task datasets
│ ├── dbbench/ # Database operations (SQL)
│ ├── os_interaction/ # OS commands (Shell)
│ ├── knowledgegraph/ # KG queries (SPARQL)
│ ├── alfworld/ # Embodied tasks
│ ├── webshop/ # E-commerce tasks
│ └── locomo/ # Long dialogues (10 conversations)
│
├── memory/ # Memory mechanism implementations
│ ├── base.py # Base class for all memory mechanisms
│ ├── registry.py # Memory registry system
│ ├── zero_shot/ # Baseline (no memory)
│ ├── streamICL/ # RAG-based retrieval (topk=4)
│ ├── awmPro/ # System memory via workflows (topk=8)
│ ├── mem0/ # Personal memory via preferences
│ └── MEMs/ # Multi-memory coordination (proposed)
│
├── execution/ # Execution engines
│ ├── base.py # Base execution engine
│ └── single_agent/ # Single-agent executor
│
├── src/ # Core implementation
│ ├── runner/ # Main entry point
│ │ ├── main.py # Experiment runner
│ │ ├── builders.py # Component builders
│ │ ├── config.py # Configuration parser
│ │ └── schedule_utils.py # Scheduling utilities
│ ├── client/ # Client-side scheduling
│ │ ├── backend.py # Backend interface
│ │ └── scheduler.py # Task scheduler
│ ├── server/ # Backend task servers (Docker)
│ │ └── tasks/ # Task implementations
│ └── utils/ # Analysis utilities
│ ├── message_schema.py # Message format compatibility layer
│ └── analyze_results_*.py # Result analysis scripts
│
├── extra/ # Docker orchestration
│ ├── docker-compose.yml # Service definitions
│ └── *.Dockerfile # Task-specific containers
│
├── outputs/ # Experiment results
│ └── [timestamp]/ # Grouped by experiment time
│ └── [task_name]/ # Grouped by task
│ └── [index].json # Individual sample results
│
├── requirements.txt # Python dependencies
└── README.md # This file
<h2 id="quick-start">🚀 Quick Start</h2>
1. Prerequisites
Python Environment
# Create conda environment with Python 3.9
conda create -n aMB python=3.9
# Activate environment
conda activate aMB
# Navigate to project directory
cd /path/to/AgentMemoryBench
# Install dependencies
pip install -r requirements.txt
Docker Installation
Docker is required to run backend task servers. Install Docker Desktop:
- Windows/Mac: Docker Desktop
- Linux: Follow official guide
2. Data & Model Setup
Knowledge Graph (Freebase) Database
The Knowledge Graph task requires the Freebase database:
-
Download database (~50 GB):
- Download link: OneDrive
- Recommended: Use a download manager (e.g., Free Download Manager) instead of browser
-
Extract the downloaded
virtuoso_db.zip -
Configure path in
extra/docker-compose.yml(line 114):freebase: build: context: .. dockerfile: extra/freebase.Dockerfile volumes: - "/absolute/path/to/virtuoso_db:/database" # Use absolute path init: trueImportant:
- Use absolute paths
- Windows: Use forward slashes
/(e.g.,C:/Users/...) - Example:
B:/desktop/AgentMemoryBench/virtuoso_db:/database
LoCoMo Tokenizer
Download the tokenizer model for fair evaluation:
# Download xlm-roberta-base from HuggingFace
# https://huggingface.co/FacebookAI/xlm-roberta-base
# Configure path in src/server/tasks/locomo/task.py (line 47)
tokenizer = AutoTokenizer.from_pretrained("/path/to/xlm-roberta-base")
Embedding Model (for streamICL, awmPro, MEMs)
Download the embedding model for fair comparison:
# Download bge-base-en-v1.5 from HuggingFace
# https://huggingface.co/BAAI/bge-base-en-v1.5
# Configure paths in YAML files:
# - memory/streamICL/streamICL.yaml
# - memory/awmPro/awmPro.yaml
# - memory/MEMs/MEMs.yaml
Mem0 API Key
To use the Mem0 method:
- Register for API key at mem0.ai
- Configure in
memory/mem0/mem0.yaml:api_key: "your_mem0_api_key_here" wait_time: 60.0 # Recommended: 60s for system tasks, 150s for personal, 100s for mixed
3. Start Backend Services
# Navigate to Docker directory
cd extra
# Build required containers
docker pull mysql:8
docker-compose build local-os-default
docker-compose build local-os-packages
docker-compose build local-os-ubuntu
docker-compose build freebase
# Start all services
docker-compose up
Note: Keep this terminal running. Services run on http://localhost:5038
4. Configure LLM API
Recommended: Use SiliconFlow API to avoid model name mismatches.
Agent LLM Configuration
Edit configs/llmapi/api.yaml:
base_url: "https://ap
