Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

Yunfei Xie,1, Kevin Wang,+,2, Bobby Cheng*,4, Jianzhu Yao3, Zhizhou Sha2, Alexander Duffy5, Yihan Xi2, Hongyuan Mei6, Cheston Tan4, Chen Wei†,1, Pramod Viswanath†,3, Zhangyang Wang†,2

1Rice University, 2The University of Texas at Austin, 3Princeton University, 4A*STAR, 5Good Start Labs, 6TTIC

*Equal Contribution +Project Leader †Equal Advising

</div>

Introduction

MEMO is a self-play framework that treats inference-time context as an optimizable, agentic object by coupling retention and exploration. Retention maintains a persistent memory bank that distills self-play trajectories into structured insights, consolidates them via CRUD-style updates, and injects them as priors during subsequent play. Exploration performs tournament-style prompt evolution with uncertainty-aware selection via TrueSkill, and uses prioritized replay to revisit vital states for sample-efficient coverage.

Across five text-based games, MEMO raises mean win rate from 24.9% to 49.5% for GPT-4o-mini and 21.7% to 44.3% for Qwen-2.5-7B-Instruct using a mere budget of 2000 self-play games per task.

<img src="assets/comparison_visual.png" width="100%"> Left: Standard LLM agents discard experience after each game. Middle: RL approaches learn from reward signals but require many samples. Right: MEMO recognizes patterns from trajectories, extracts insights, and accumulates them in a memory bank across generations.

The system combines:

Tournament-based evaluation using TrueSkill / win-rate ratings
Trajectory memory for learning from successful game patterns
Replay buffer for prioritized experience replay of strategic game states
Multiple evolution strategies (keep-best, random exploration, memory-guided, trajectory-based, crossover)
Integration with TextArena for diverse multi-agent text-based games

Key Features

Self-Play Prompt Evolution — Prompts compete against each other and evolve based on performance
Trajectory Memory System — Extracts strategic insights from games to guide prompt improvements
Replay Buffer — Prioritized sampling of past game states to diversify tournament play
Multiple Evolution Strategies — Keep top performers, random exploration, memory-guided generation, trajectory-based, crossover
Flexible Evaluation — Evaluate evolved prompts against configurable lists of eval models
W&B Integration — Track performance, token usage, and diversity metrics across generations

Supported Games

MEMO works with any TextArena environment. The games we commonly evaluate on:

| Game | Env ID | Players | Type | Description | |------|--------|---------|------|-------------| | SimpleNegotiation | SimpleNegotiation-v0-short | 2 | Negotiation | Trading game where players negotiate resource exchanges to maximize inventory value, with asymmetric valuations of five resources (Wheat, Wood, Sheep, Brick, Ore). | | SimpleTak | SimpleTak-v0 | 2 | Strategy | Connection game where players place stones to form a continuous path connecting opposite edges of the board. | | KuhnPoker | KuhnPoker-v0-short | 2 | Imperfect Info | Simplified poker variant using a 3-card deck (J, Q, K) with a single betting round. | | TicTacToe | TicTacToe-v0 | 2 | Strategy | Classic grid game — take turns marking cells to form three in a row. | | Briscola | Briscola-v0 | 2–4 | Trick-Taking | Italian card game using a 40-card deck with trump suits; collect tricks to score points. |

Several games support length variants (e.g., -short, -medium, -long, -extreme) that control the number of rounds or turns per game.

Installation

Prerequisites

Python >= 3.10
API keys for LLM providers (OpenRouter, OpenAI, etc.) in a .env file

Clone with Submodules

git clone <repo-url>
cd <repo>
git submodule update --init  # Do NOT use --recursive

Warning: Do not use --recurse-submodules or --recursive. The TextArena submodule contains a nested self-referential submodule that should not be initialized.

Install Dependencies

pip install -e .
# or
pip install -r requirements.txt

This installs TextArena from the mpr branch submodule along with all other dependencies (wandb, trueskill, torch, etc.).

Environment Variables

Create a .env file in mpr/ with your API keys, or export to set these api key:

OPENROUTER_API_KEY=sk-or-...
WANDB_API_KEY=...

Quick Start

Run a prompt evolution experiment using a shell script:

bash mpr/scripts/tests_latest/simplenegotiation.sh

Or configure and run directly:

python -m mpr.self_play_prompt_evolution_memory \
    --model "gpt-4o-mini" \
    --baseline-model "gpt-4o-mini" \
    --base-prompt "You are playing a two-player game. Make valid moves to win. Submit the move enclosed by \boxed{}." \
    --eval-model-list google/gemini-2.5-flash-lite \
    --env "SimpleNegotiation-v0-short" \
    --generations 5 \
    --tournament-rounds 25 \
    --eval-rounds 25 \
    --population-size 8 \
    --keep-ratio 0.25 \
    --random-ratio 0.75 \
    --fitness-method winrate \
    --use-replay true \
    --skip-baseline-eval

Key Configuration Options

| Parameter | Description | Default | |-----------|-------------|---------| | --population-size | Number of prompt candidates per generation | 10 | | --generations | Number of evolution generations | 10 | | --tournament-rounds | Games per self-play tournament phase | 5 | | --eval-rounds | Games per evaluation phase | 3 | | --keep-ratio | Fraction of top prompts kept as elites | 0.3 | | --random-ratio | Fraction of new random prompts | 0.2 | | --memory-guided-ratio | Fraction using memory insights | 0.0 | | --trajectory-ratio | Fraction from trajectory-based learning | 0.3 | | --crossover-ratio | Fraction from crossover | 0.2 | | --fitness-method | "winrate" or "trueskill" | "trueskill" | | --temperature | Sampling temperature (0.0 = deterministic) | 1.0 | | --use-replay | Enable replay buffer for diverse play | false | | --skip-baseline-eval | Skip generation-0 baseline evaluation | false | | --eval-model-list | Models to evaluate best candidate against | — |

Note: Evolution ratios must sum to 1.0.

Evolution Strategies

Keep — Retain top-performing prompts unchanged
Random — Generate novel prompts (optionally guided by memory insights)
Memory-Guided — Use extracted game insights to improve prompts
Trajectory-Based — Learn from game trajectories and reflections
Crossover — Combine elements from high-performing prompts

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    SelfPlayPromptEvolution                   │
│                    (Main Orchestrator)                       │
└─────────────────────────────────────────────────────────────┘
                              │
           ┌──────────────────┼──────────────────┐
           ▼                  ▼                  ▼
    ┌─────────────┐    ┌─────────────┐    ┌─────────────────┐
    │  Tournament │    │  Prompt     │    │  Trajectory     │
    │  System     │    │  Evolution  │    │  Memory         │
    │             │    │  Engine     │    │  System         │
    └─────────────┘    └─────────────┘    └─────────────────┘
           │                  │                  │
           ▼                  ▼                  ▼
    ┌─────────────┐    ┌─────────────┐    ┌─────────────────┐
    │  AgentPool  │    │  Prompt     │    │  Game Insights  │
    │  + Ratings  │    │  Candidates │    │  + Replay Buffer│
    └─────────────┘    └─────────────┘    └─────────────────┘
           │
           ▼
    ┌─────────────┐
    │  GameRunner │──────▶ TextArena Environments
    └─────────────┘

Project Structure

├── mpr/                                    # MEMO system
│   ├── self_play_prompt_evolution_memory.py # Main orchestrator & CLI entry point
│   ├── cores/                              # Core components
│   │   ├── game_runner.py                  # Async game execution
│   │   ├── templates.py                    # Prompt templates for agents
│   │   ├── tournament_scheduler.py         # Game scheduling (round-robin, vs_best, etc.)
│   │   └── evaluation.py                   # Evaluation utilities
│   ├── tournament/                         # Tournament system
│   │   ├── tournament.py                   # Tournament orchestration & eval pipeline
│   │   └── agent_pool.py                   # Agent management & TrueSkill ratings
│   ├── evaluation/                         # Tournament runners
│   │   └── simple_memory_eval.py           # Main tournament execution (run_tournament)
│   ├── memory/

MEMO

Install / Use

README