RealMemBench

No description available

Generate Convert Improve

Install / Use

/learn @AvatarMemory/RealMemBench

About this skill

Quality Score

0/100

README

<div align="center"> <h1 align="center" style="color: #2196F3; font-size: 24px; font-weight: 600; margin: 20px 0; line-height: 1.4;"> 🧠 RealMem: <span style="color: #555; font-weight: 400; font-size: 18px;"><em>Benchmarking LLMs in Real-World Memory-Driven Interaction</em></span> </h1> <p style="margin: 20px 0;"> <a href="https://arxiv.org/abs/2601.06966"><img src="https://img.shields.io/badge/arXiv-2601.06966-B31B1B.svg?style=flat-square&logo=arxiv&logoColor=white" /></a> <a href="#"><img src="https://img.shields.io/badge/Python-3.8+-blue.svg?style=flat-square&logo=python&logoColor=white" /></a> <a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-green.svg?style=flat-square" /></a> <a href="https://quantaalpha.github.io/"><img src="https://img.shields.io/badge/Team-QuantaAlpha-00A98F.svg?style=flat-square&logo=opensourceinitiative&logoColor=white" /></a> </p> </div>

📰 News

2026.01.11 🎉 We released the paper RealMem on arXiv.
2026.01.11 🎉 We open-sourced RealMem — a robust multi-agent framework designed to simulate realistic user-assistant interactions with sophisticated memory management.

🧭 Motivation and Goal

The ultimate goal of dialogue systems is to maintain long-term consistency and memory across multiple sessions, mimicking human-like interaction.

<p align="center"> <img src="figs/image2.png" width="800" /><br> <em>RealMem Illustration.</em> </p>

⚠️ The Challenge: Existing benchmarks and generation frameworks often fail to capture the complexity of long-term memory. Single-session dialogues lack the continuity required to evaluate an agent's ability to recall past preferences, events, and context over time.

👋 Our Solution: To address this gap, we introduce RealMem. Our framework employs a Multi-Agent architecture where specialized agents (User, Assistant, Evaluator, Memory Manager) collaborate to generate coherent, multi-session dialogues. By strictly controlling the "User's" temporal perception and the "Assistant's" memory retrieval, RealMem produces high-quality datasets for training and evaluating long-context LLMs.

<p align="center"> <img src="figs/image.png" width="800" /><br> <em>Overview of RealMem Framework.</em> </p>

RealMem operates as a modular pipeline, transforming high-level project outlines into granular, multi-turn dialogues. It consists of four core components: a User Agent that simulates user behavior with strict temporal constraints, an Assistant Agent that provides professional responses, a Goal Evaluator that assesses task completion in real-time, and a Memory Manager that handles the extraction and deduplication of structured memory points.

✨ Key Features

🤖 Multi-Agent Architecture: Collaborative agents simulate authentic interactions.
🧠 Intelligent Memory Management: Automated extraction, storage, and deduplication of memory points.
🎯 Long-Term Task-Oriented: Dialogues are driven by explicit goals with automatic success evaluation.
⏰ Temporal Logic Control: Strict enforcement of time constraints to prevent information leakage from future events.
🔄 Context Continuity: Maintains logical consistency across multiple sessions via memory retrieval.

📜 Dataset Format

The dataset consists of multiple JSON files located in dataset/datasets/, each corresponding to a distinct user persona (e.g., Adeleke_Okonjo_dialogues_256k.json). These files contain the full multi-session interaction history.

Within each file, the structure is organized as follows:

_metadata: Contains global information including person_name, total_sessions, and total_tokens.
dialogues: A list of dialogue sessions. Each session object contains the following fields:
- session_identifier: The unique identifier for the session (e.g., Knowledge_Learning_1:S1_01).
- session_uuid: The UUID for the session.
- current_time: The simulated date and time of the session.
- extracted_memory: A list of structured memory points extracted from the session. Each item contains:
  - index: Memory index (e.g., Travel_Planning_2-DM-S1_01-01).
  - type: Memory type (e.g., Dynamic).
  - content: The textual content of the memory.
  - source_turn: The turn index where this memory was extracted.
  - source_content_snapshot: A snapshot of the source content.
  - source_role_snapshot: The role of the speaker in the source snapshot.
  - session_uuid: The UUID of the session where memory was created.
- dialogue_turns: A list of dialogue turns. Each turn is a dictionary with the following fields:
  - speaker: The role of the speaker (User or Assistant).
  - content: The text content of the message.
  - is_query: true if the turn represents a memory retrieval query, false otherwise.
  - query_id: The unique ID for the query (if is_query is true).
  - memory_used: The memory points retrieved and used by the assistant for generating this specific response. (List of objects containing session_uuid and content).
  - memory_session_uuids: A list of session UUIDs corresponding to the memories used.

🚀 How to Run

0. Directory Structure

RealMem/
├── dataset/                     # 💾 Generated Dialogues (e.g., Lin_Wanyu_dialogues_256k.json)
│   └── all_persona_topic/       # Persona & Topic Definitions
├── pipeline/                    # 🔄 Core Processing Pipeline
│   ├── base_processor.py           # Base Interface
│   ├── project_outline_processor.py # Project Blueprint Generation
│   ├── event_processor.py          # Event Sequence Generation
│   ├── summary_processor.py        # Session Summary Generation
│   └── multi_agent_dialogue_processor.py # Multi-Agent Core
├── utils/                       # 🛠 Utility Toolkit
│   ├── llm_client.py               # LLM Client (w/ Retry)
│   ├── error_handler.py            # Error Handling & JSON Parsing
│   ├── data_validator.py           # Data Validation
│   ├── dialogue_validator.py       # Dialogue Logic Verification
│   └── dialogue_postprocessor.py   # Post-processing & Cleaning
├── eval/                        # 📈 Evaluation Metrics
│   ├── run_generation.py          # Evaluation Generation Runner
│   ├── compute_auto_metrics_for_realmem.py # Automated Metrics
│   └── compute_llm_metrics_for_realmem.py  # LLM-based Metrics
├── prompts/                     # 📝 Prompt Templates
│   ├── project_outline.txt         # Project Outline Prompt
│   ├── event.txt                   # Event Generation Prompt
│   ├── summary.txt                 # Session Summary Prompt
│   └── refine.txt                  # Dialogue Refinement Prompt
├── figs/                        # 🖼️ Figures & Assets
├── main.py                      # 🚀 Main Entry Point
└── requirements.txt             # 📦 Dependencies

1. Set Up ⚙️

First, clone the repository and create a suitable environment:

# Install dependencies
pip install -r requirements.txt

Then, configure your environment variables:

# Copy example configuration
cp .env.example .env

# ⚠️ Edit .env to add your API Keys (e.g., OpenAI API Key)

Ensure the following base data files exist (for Persona and Topic generation):

dataset/all_persona_topic/person&goal.json
dataset/all_persona_topic/persona_all.json

2. Quick Start 💡

Standard Generation (Recommended)

To start the full pipeline generation using the main Python entry point:
```
python main.py --names "Lin Wanyu" --smart-recovery
```
🔧 Options:
- --names <names>: (Recommended) Specify the target persona name. See dataset/all_persona_topic/persona_all.json for available names (e.g., "Ethan Hunt", "Sarah Miller", "Kenta Tanaka"). Default: Process All.
- --projects <num>: Number of projects (dialogue topics) to generate per person. Default: 3.
- --max-turns <num>: Maximum number of turns per dialogue session. Default: 24.
- --output <dir>: Output directory path. Default: output.
- --smart-recovery: Enable smart interrupt recovery (resume from previous state). Default: False.
- --log: Enable verbose logging for debugging. Default: False.
🤖 Model Configuration:
- --blueprint-model <model>: Model for generating project outlines.
- --event-model <model>: Model for generating event sequences.
- --summary-model <model>: Model for generating session summaries.
- --dialogue-model <model>: Model for generating the actual dialogue.
- --memory-model <model>: Model for memory extraction.

📊 Evaluation

RealMem provides a comprehensive evaluation suite in the eval/ directory.

0. Evaluation Pipeline Logic

The evaluation pipeline follows a strict temporal sequence, processing dialogues session by session. We iterate through the sessions to update the memory state. When a query is detected within a session, we trigger retrieval and generation based on the history accumulated from previous sessions:

for session in dialogue_sessions:
    # 1. Evaluate Queries in Session
    for i, turn in enumerate(session['turns']):
        if turn.get('is_query', False):
            question = turn.get('content', '')

            # Generate Keywords & Retrieve Context (from all historical sessions)
            keywords = self.generate_query_llm(question)
            memories = self.retrieve_memories(question, keywords, k=10)

            # Generate Answer
            generated_answer = self.generate_answer(question, memories)

    # 2. Update Memory with Session Content (for future sessions)
    self.memory_system.add_session_content(session)

1. Response Generation

Generate responses using retrieved memory context to simulate the model's ability to utilize long-term

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。