SkillAgentSearch skills...

AgentMemoryBench

A unified benchmark for evaluating continual agent memory in LLM-based systems across 5 evaluation modes (Online, Offline, Replay, Transfer, Repair) and 6 interactive tasks, supporting both system and personal memory mechanisms.

Install / Use

/learn @s010m00n/AgentMemoryBench
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <img src="figures/AgentMemoryBench.svg" width="100%" alt="Agent Memory Bench" /> <br/> <br/> <a href="https://github.com/s010m00n/AgentMemoryBench/stargazers"> <img src="https://img.shields.io/github/stars/s010m00n/AgentMemoryBench?style=for-the-badge&logo=github&color=ff6b6b" alt="Stars"> </a> <a href="https://github.com/s010m00n/AgentMemoryBench/network/members"> <img src="https://img.shields.io/github/forks/s010m00n/AgentMemoryBench?style=for-the-badge&logo=github&color=ee5a6f" alt="Forks"> </a> <a href="https://github.com/s010m00n/AgentMemoryBench/issues"> <img src="https://img.shields.io/github/issues/s010m00n/AgentMemoryBench?style=for-the-badge&logo=github&color=c44569" alt="Issues"> </a> <a href="https://github.com/s010m00n/AgentMemoryBench/blob/main/LICENSE"> <img src="https://img.shields.io/badge/License-MIT-brightgreen?style=for-the-badge" alt="License"> </a> <br/> <br/> <p align="center"> <strong>A Unified Benchmark for Continual Agent Memory</strong> <br /> <br /> A comprehensive benchmark for evaluating memory mechanisms in LLM-based agents across continual learning scenarios, supporting both <strong>system memory</strong> (task workflows) and <strong>personal memory</strong> (user preferences). <br /> <br /> <a href="#overview">Overview</a> • <a href="#evaluation-modes">Evaluation Modes</a> • <a href="#quick-start">Quick Start</a> • <a href="#creating-custom-memory-mechanisms">Custom Memory</a> • <a href="#implemented-memory-mechanisms">Methods</a> </p> </div>
<h2 id="overview">🎯 Overview</h2>

AgentMemoryBench provides a unified framework to evaluate how LLM agents learn and retain two types of memory:

  • System Memory: Task workflows and execution patterns
  • Personal Memory: User preferences and dialogue context

The benchmark spans 6 interactive tasks across 4 grounding types:

  • Code-grounded: Database (SQL), Operating System (Shell), Knowledge Graph (SPARQL)
  • Embodied: ALFWorld (household tasks)
  • Web-grounded: WebShop (e-commerce)
  • Dialogue-grounded: LoCoMo (long-term conversations)
<h2 id="evaluation-modes">📊 Evaluation Modes</h2>

AgentMemoryBench supports 5 complementary evaluation modes to provide multi-dimensional assessment of memory systems:

Evaluation Modes

1. Offline Mode

Traditional train-test split evaluation. The agent learns from training samples (memory formation & evolution) and is tested on held-out samples (retrieval only).

Metrics: Average Success Rate (ASR), Average Steps (AS), F1-score, BLEU, LLM-as-Judge

2. Online Mode

Streaming evaluation where agents process samples sequentially with real-time memory updates. Performance is recorded after each sample to capture learning dynamics.

Metrics: Cumulative Success Rate (CSR), Learning Gain (LG), Stability Loss (SL)

3. Replay Mode

Periodic testing to measure knowledge retention and resistance to forgetting. After learning each stage, agents are tested on previously learned samples.

Metrics: Forgetting Rate (FR), Average Success Rate (ASR)

4. Transfer Mode

  • Cross-environment: Tests knowledge generalization across different domains (e.g., DB→OS)
  • Within-environment: Measures forward transfer—how learning current samples helps future ones

Metrics: Transfer Gain (TG), Forward Transfer Gain (FTG)

5. Repair Mode

Tests robustness and self-correction under erroneous feedback. Agents learn under incorrect rewards, then repair memory with correct feedback.

Metrics: Error Robustness (ER), Repair Gain (RG), Net Recovery (NR)

🏗️ Project Structure

AgentMemoryBench/
├── configs/                    # Configuration files
│   ├── assignment/            # Experiment configurations
│   │   └── default.yaml       # Main experiment config
│   ├── tasks/                 # Task-specific configs (6 tasks)
│   │   ├── dbbench.yaml       # Database (SQL)
│   │   ├── os.yaml            # Operating System (Shell)
│   │   ├── kg.yaml            # Knowledge Graph (SPARQL)
│   │   ├── alfworld.yaml      # Embodied AI
│   │   ├── webshop.yaml       # E-commerce
│   │   └── locomo-*.yaml      # Long conversations (0-9)
│   └── llmapi/                # LLM API configurations
│       ├── api.yaml           # API endpoint & key for agent LLM
│       ├── agent.yaml         # Agent model name
│       ├── evaluate_api.yaml  # API for LoCoMo LLM-as-Judge
│       └── evaluate_agent.yaml# Model for evaluation
│
├── data/                       # Task datasets
│   ├── dbbench/               # Database operations (SQL)
│   ├── os_interaction/        # OS commands (Shell)
│   ├── knowledgegraph/        # KG queries (SPARQL)
│   ├── alfworld/              # Embodied tasks
│   ├── webshop/               # E-commerce tasks
│   └── locomo/                # Long dialogues (10 conversations)
│
├── memory/                     # Memory mechanism implementations
│   ├── base.py                # Base class for all memory mechanisms
│   ├── registry.py            # Memory registry system
│   ├── zero_shot/             # Baseline (no memory)
│   ├── streamICL/             # RAG-based retrieval (topk=4)
│   ├── awmPro/                # System memory via workflows (topk=8)
│   ├── mem0/                  # Personal memory via preferences
│   └── MEMs/                  # Multi-memory coordination (proposed)
│
├── execution/                  # Execution engines
│   ├── base.py                # Base execution engine
│   └── single_agent/          # Single-agent executor
│
├── src/                        # Core implementation
│   ├── runner/                # Main entry point
│   │   ├── main.py            # Experiment runner
│   │   ├── builders.py        # Component builders
│   │   ├── config.py          # Configuration parser
│   │   └── schedule_utils.py  # Scheduling utilities
│   ├── client/                # Client-side scheduling
│   │   ├── backend.py         # Backend interface
│   │   └── scheduler.py       # Task scheduler
│   ├── server/                # Backend task servers (Docker)
│   │   └── tasks/             # Task implementations
│   └── utils/                 # Analysis utilities
│       ├── message_schema.py  # Message format compatibility layer
│       └── analyze_results_*.py # Result analysis scripts
│
├── extra/                      # Docker orchestration
│   ├── docker-compose.yml     # Service definitions
│   └── *.Dockerfile           # Task-specific containers
│
├── outputs/                    # Experiment results
│   └── [timestamp]/           # Grouped by experiment time
│       └── [task_name]/       # Grouped by task
│           └── [index].json   # Individual sample results
│
├── requirements.txt            # Python dependencies
└── README.md                   # This file
<h2 id="quick-start">🚀 Quick Start</h2>

1. Prerequisites

Python Environment

# Create conda environment with Python 3.9
conda create -n aMB python=3.9

# Activate environment
conda activate aMB

# Navigate to project directory
cd /path/to/AgentMemoryBench

# Install dependencies
pip install -r requirements.txt

Docker Installation

Docker is required to run backend task servers. Install Docker Desktop:

2. Data & Model Setup

Knowledge Graph (Freebase) Database

The Knowledge Graph task requires the Freebase database:

  1. Download database (~50 GB):

    • Download link: OneDrive
    • Recommended: Use a download manager (e.g., Free Download Manager) instead of browser
  2. Extract the downloaded virtuoso_db.zip

  3. Configure path in extra/docker-compose.yml (line 114):

    freebase:
      build:
        context: ..
        dockerfile: extra/freebase.Dockerfile
      volumes:
        - "/absolute/path/to/virtuoso_db:/database"  # Use absolute path
      init: true
    

    Important:

    • Use absolute paths
    • Windows: Use forward slashes / (e.g., C:/Users/...)
    • Example: B:/desktop/AgentMemoryBench/virtuoso_db:/database

LoCoMo Tokenizer

Download the tokenizer model for fair evaluation:

# Download xlm-roberta-base from HuggingFace
# https://huggingface.co/FacebookAI/xlm-roberta-base

# Configure path in src/server/tasks/locomo/task.py (line 47)
tokenizer = AutoTokenizer.from_pretrained("/path/to/xlm-roberta-base")

Embedding Model (for streamICL, awmPro, MEMs)

Download the embedding model for fair comparison:

# Download bge-base-en-v1.5 from HuggingFace
# https://huggingface.co/BAAI/bge-base-en-v1.5

# Configure paths in YAML files:
# - memory/streamICL/streamICL.yaml
# - memory/awmPro/awmPro.yaml
# - memory/MEMs/MEMs.yaml

Mem0 API Key

To use the Mem0 method:

  1. Register for API key at mem0.ai
  2. Configure in memory/mem0/mem0.yaml:
    api_key: "your_mem0_api_key_here"
    wait_time: 60.0  # Recommended: 60s for system tasks, 150s for personal, 100s for mixed
    

3. Start Backend Services

# Navigate to Docker directory
cd extra

# Build required containers
docker pull mysql:8
docker-compose build local-os-default
docker-compose build local-os-packages
docker-compose build local-os-ubuntu
docker-compose build freebase

# Start all services
docker-compose up

Note: Keep this terminal running. Services run on http://localhost:5038

4. Configure LLM API

Recommended: Use SiliconFlow API to avoid model name mismatches.

Agent LLM Configuration

Edit configs/llmapi/api.yaml:

base_url: "https://ap
View on GitHub
GitHub Stars5
CategoryCustomer
Updated1d ago
Forks0

Languages

Python

Security Score

85/100

Audited on Apr 8, 2026

No findings