<div align="center"> <h1>From Debate to Equilibrium: Belief‑Driven Multi‑Agent LLM Reasoning via Bayesian Nash Equilibrium</h1> <h3>Efficient Coordination via Nash Equilibrium for Multi-Agent LLM Framework</h3>

<div align="center"> <figure> <br> <p><em>A multi-agent reinforcement learning framework that combines Large Language Models with coordinated decision-making for complex reasoning tasks</em></p> </figure> </div> </div>

Motivation

Existing multi-agent frameworks face significant limitations when applied to Large Language Models (LLMs). Traditional approaches struggle with the high-dimensional nature of language models and lack proper coordination mechanisms for complex reasoning tasks.

<div align="center"> <figure> <img src="assets/compare.jpg" alt="ECON vs Traditional MAD Comparison" width="800"> <br> <p><em>Comparison between ECON and traditional Multi-Agent Debate (MAD) approaches</em></p> </figure> </div>

Current multi-agent LLM systems suffer from:

Prohibitive Communication Costs: Traditional multi-agent debate relies on explicit message passing, incurring substantial token costs and computational overhead
No Convergence Guarantees: Current approaches lack theoretical assurances of converging to stable, effective solutions
Scalability Challenges: Information exchange often exceeds LLM context limits, severely impeding scalability in large agent ensembles

Our Solution: ECON Framework

<div align="center"> <figure> <img src="assets/framework.jpg" alt="ECON Framework Architecture" width="800"> <br> <p><em>ECON's two-stage coordination architecture with Bayesian Nash Equilibrium</em></p> </figure> </div>

To address these critical challenges, we introduce ECON - a multi-agent LLM framework that implements efficient coordination via Bayesian Nash Equilibrium, enabling scalable and theoretically grounded multi-agent reasoning.

Implicit Belief-Driven Coordination: Replaces costly message passing with belief-based coordination, dramatically reducing communication overhead
Guaranteed Convergence to Equilibrium: Establishes a rigorous Bayesian Nash Equilibrium (BNE) framework with theoretical convergence guarantees
Hierarchical & Scalable Architecture: Enables effective coordination in large ensembles via a local-to-global approach that respects LLM context limits

Minimal Usage

Installation

We provide two installation methods:

Package Installation (Recommended)

Install the ECON framework dependencies:

pip install -r requirements.txt

Development Installation

For development or customization, clone the repository and set up the environment:

# Clone the repository
git clone https://github.com/yourusername/ECON.git
cd ECON

# Create and activate conda environment  
conda create -n econ python=3.8
conda activate econ

# Install dependencies
pip install -r requirements.txt

Model Setup

Before running the framework, you need to set up the Together AI API key:

export TOGETHER_API_KEY="your_together_ai_api_key"

Usage

Quick Start with Command Line Interface

Set your API key once:

export TOGETHER_API_KEY="your_together_ai_api_key"

One-line Math sanity run (train 1 ep, test 5 eps)

python scripts/run_math_test.py \
  --train-eps 1 \
  --test-eps 5 \
  --log-dir logs_exp1 \
  --model-dir models_exp1

Default P0 training + BNE testing

python scripts/run_p0_test.py \
  --train-eps 100 \
  --test-eps 30 \
  --log-dir logs_exp1 \
  --model-dir models_exp1

Notes:

max_rounds defaults to 1 (single decision) in scripts/config_p0.yaml and scripts/config_math.yaml; increase if you need multi-round episodes.
Reward weights in env reuse the α weights learned during training; override via reward.initial_weights in config.
All schemes expect agent_memory so runner/learner can consume short-term trajectories.

Configuration

Key Parameters

n_agents: Number of executor agents (e.g., 3, 5, 8)
coordinator_model: Coordinator LLM model name
executor_model: Executor LLM model name
update_interval: Gradient update frequency (default: 10 steps)
bne_max_iterations: Maximum BNE coordination iterations
belief_dim: Dimension of agent belief states
sampling.temperature_min/max: Bounds for temperature
sampling.p_min/max: Bounds for repetition penalty (second action dimension)
sampling.top_p_default: Fixed top_p used for generation (default 0.9)

Supported Models

The framework supports any open-source language model accessible via Together AI API. Models can be hosted using:

Together AI: For remote model serving with API access
Local APIs: Compatible with OpenAI-style APIs

Example: Using Llama-3.3-70B-Instruct-Turbo

./run_econ.sh \
  --api-key YOUR_API_KEY \
  --config src/config/config.yaml \
  --agents 3 \
  --experiment-name llama-coordination-test

Custom Datasets

Create your own datasets following the Hugging Face format with question and answer fields:

env_args:
  hf_dataset_path: "your_custom_dataset"
  dataset_split: "train"
  question_field_name: "question"
  answer_field_name: "answer" 
  max_question_length: 1024
  max_answer_length: 512

Testing & Evaluation

Available Testing Methods

The framework provides multiple testing approaches for comprehensive model validation:

1. Integrated Training + Testing (scripts/run_p0_test.py)

Train and test in one command (recommended for quick experiments):

# Quick test (5 train episodes, 3 test episodes)
python scripts/run_p0_test.py \
  --train-eps 5 \
  --test-eps 3 \
  --log-dir logs_quick \
  --model-dir models_quick

# Full training (100 train episodes, 30 test episodes)
python scripts/run_p0_test.py \
  --train-eps 100 \
  --test-eps 30 \
  --log-dir logs_exp1 \
  --model-dir models_exp1

Features:

Automatically runs training followed by BNE testing (3 rounds)
Saves model checkpoints to --model-dir
Logs test traces to --log-dir/llm_traces_test_bne_3rounds.json
Reports accuracy and P0 metadata (JSON parsing rate)

2. Testing Pre-trained Models (scripts/test_p0.py)

Test existing trained models without retraining:

export TOGETHER_API_KEY="your_api_key"
python scripts/test_p0.py

Configuration:

Edit MODEL_DIR variable in test_p0.py to point to your trained model directory (e.g., ./models_exp1/final)
Runs both baseline (no BNE) and P0 BNE (3 rounds) tests
Outputs: logs_p0_test_baseline.json and logs_p0_test_p0.json

Example Output:

Baseline (no BNE):    10/10 = 100.0%
P0 BNE (3 rounds):    10/10 = 100.0%
P0 Metadata:         JSON=100%

Note: src/eval.py is not part of the current workflow; rely on the above test scripts for evaluation.

3. Dataset-Specific Testing

Test on MATH or SVAMP datasets:

# MATH dataset
python scripts/run_math_test.py \
  --train-eps 5 \
  --test-eps 10 \
  --log-dir logs_math \
  --model-dir models_math

# SVAMP dataset
python scripts/run_svamp_test.py \
  --train-eps 5 \
  --test-eps 10 \
  --log-dir logs_svamp \
  --model-dir models_svamp

Already-trained checkpoints can be evaluated directly with scripts/test_math.py and scripts/test_svamp.py (baseline vs BNE, 10 episodes each by default).

Episode Structure Explanation

Important: Episodes in ECON have a unique structure that differs from traditional RL environments.

Default Setup (Single-Decision Episodes):

Episode = One math problem
├─ t=0: Decision step (includes K internal BNE refinement rounds)
│   └─ BNE coordination: belief updates, response generation, convergence
└─ t=1: Terminal state (reward computation)

Key Points:

One episode = one math problem (not multiple attempts)
2 RL timesteps = 1 decision + 1 terminal (standard episodic RL)
BNE refinement (K rounds) happens internally at t=0
Multi-round debate occurs inside the decision step via belief coordination

Internal BNE Process (at t=0):

# Within single timestep t=0:
Round 0: e_init → LLM outputs → Commitment_0
Round 1: e_refined_1 → LLM outputs → Commitment_1
Round 2: e_refined_2 → LLM outputs → Commitment_2
Final: Submit Commitment_2 as answer