MARTI
A Framework for LLM-based Multi-Agent Reinforced Training and Inference
Install / Use
/learn @TsinghuaC3I/MARTIREADME
MARTI: A Framework for LLM-based Multi-Agent Reinforced Training and Inference
</div> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update.</h5> <div align="center"> <img src="https://readme-typing-svg.herokuapp.com?font=Orbitron&size=20&duration=3000&pause=1000&color=00D9FF¢er=true&vCenter=true&width=800&lines=Welcome+to+MARTI;Multi-Agent+RL+Framework;Now+with+Multi-Agent+Tree+Search+Support;Powered+by+Tsinghua+x+Shanghai+AI+Lab" alt="Typing Animation" /> </div>MARTI is an open-source framework for training LLM-based Multi-Agent Systems (MAS) with Reinforcement Learning (RL). It enables powerful, scalable, and adaptive workflows by combining centralized multi-agent interactions with distributed policy training. MARTI supports both built-in graph-based workflows and popular third-party multi-agent frameworks.
MARTI-v2 extends the framework with tree search-augmented RL for complex reasoning tasks like code generation. By integrating multi-agent tree search, MARTI-v2 enables efficient multi-turn exploration with adaptive node expansion and refinement, allowing agents to systematically explore solution spaces and discover high-quality reasoning trajectories. The framework also incorporates advanced RL training techniques (GSPO loss for sequence-level optimization, TIS correction for vLLM sampling mismatch, dynamic data filtering, overlong buffer for token penalty) to support ultra-long sequences up to 32K tokens and heterogeneous multi-agent training.
We hope that MARTI not only advances reasoning capabilities beyond those of individual large language models or reasoning models, but also fosters collective intelligence as a step toward general artificial intelligence.
📣 Latest News
- [2026-02-10] 🚀🚀🚀 We release MARTI-v2 with scaling multi-agent tree search via reinforcement learning for code generation (MARS<sup>2</sup>). Look at 🌳 MARS² - Multi-Agent Tree Search RL (New!) and Technical Report.
- [2026-01-25] MARTI was accepted by ICLR 2026, congrats to the team.
- [2025-10-10] We’re thrilled to see both ReviewRL (EMNLP 2025) and CoMAS being built on MARTI!
- [2025-08-05] We have introduced new support for Async Tool Use in Agentic RL, and Async Workflow for Multi-Agent RL. This enables more flexible and efficient RL pipelines, supporting both single-agent and multi-agent scenarios. Look at 🤝 Customised Async Step and Workflow.
- [2025-05-27] We release the codebase of MARTI framework, welcome to have a try on LLM-based multi-agent reinforcement learning. 🤗
Table of Contents
- 💡 Overview
- 🚀 Quick Start
- 📊 Experimental Results
- 📚 Documentation
- 👏 Acknowledge
- 🤝 Core Contributors
- 📬 Contact
- 🔬 Citation
- ⭐️ Star History
💡 Overview
MARTI-v2: Tree Search-Augmented Multi-Agent RL (🔥New!)
MARTI-v2 extends the framework with tree search-augmented reinforcement learning for complex reasoning tasks like code generation. By integrating multi-agent tree search with advanced RL techniques, MARTI-v2 enables efficient multi-step exploration with adaptive node expansion and refinement, allowing agents to systematically explore solution spaces and discover high-quality reasoning trajectories.
The framework has been adapted to the latest OpenRLHF infrastructure, incorporating state-of-the-art RL training techniques for heterogeneous multi-agent training.
<p align="center"> <img src="./assert/mars2_framework.png" width="800"> </p> <p align="center"><i>Figure 1: Overview of Core Components of MARTI-v2</i></p>Key Features:
- Multi-Agent Tree Search: Efficient tree exploration with asynchronous multi-agent tree search, supporting code generation tasks with adaptive node expansion and refinement
- GSPO Loss: Sequence-level policy optimization (vs. token-level in PPO) better suited for complex reasoning tasks
- TIS Correction: Truncated Importance Sampling addresses distribution shift in long sequence generation, enabling stable training for ultra-long contexts and correcting vLLM sampling bias during rollout
- Heterogeneous Multi-Agent Training: Train different models simultaneously (e.g., Qwen3-8B + AreaL-boba-2-8B) with independent roles, training strategies, and dynamic sample filtering per agent
MARTI
We designed the MARTI framework following the principle of centralized multi-agent interaction with distributed policy training, where all agent interactions and reward allocation occur centrally while policy training is distributed across individual agents. As illustrated in Figure 1, MARTI comprises three core modules: Multi-Agent World, Centralized Rewarding, and Single Agent Trainer.
<p align="center"> <img src="./assert/framework.jpg" width="800"> </p> <p align="center"><i>Figure 2: Overview of Core Components of MARTI</i></p>Key Features:
- Multi-Agent Inference + RL Training in a unified framework
- Graph-based workflows (debate, chain-of-agents, mixture-of-agents)
- Support for heterogeneous models within the same agent graph
- Built-in credit assignment and reward shaping strategies
- Support for diverse RL algorithms (PPO, GRPO, REINFORCE++, TTRL)
- Third-party integration with AutoGen and CAMEL (experimental)
- Advanced performance on reasoning benchmarks (e.g., AIME)
Additionally, building on single-agent RL frameworks like OpenRLHF and verl, MARTI supports the vLLM v1 Engine and a Hybrid Engine to enable fast and efficient training.
🚀 Quick Start
📦 Installation
git clone https://github.com/TsinghuaC3I/MARTI.git
cd MARTI
pip install -r requirements.txt
Follow the setup instructions for dependencies, including OpenRLHF, Ray, and vLLM.
🌳 MARS² - Multi-Agent Tree Search RL (🔥New!)
MARTI-v2 introduces tree search-augmented reinforcement learning training (MARS²) for complex reasoning tasks like code generation.
Key Features:
- Single-agent and Multi-agent MCTS training for code generation tasks
- GSPO Loss: Sequence-level policy optimization (better suited for complex reasoning than PPO's token-level optimization)
- TIS Correction: Truncated Importance Sampling to address vLLM sampling distribution mismatch
- Dynamic Filtering: Per-agent sample filtering for heterogeneous training
- Overlong Buffer: Penalty mechanism for excessively long token sequences
Single-Agent MCTS Training
# Minimum hardware requirement: approximately 8×80G GPUs
# Add path setting in scripts
ROOT_DIR="/path/to/MARTI"
MODEL_DIR="/path/to/models"
# Single-agent MCTS training
# See the script for more training examples
bash examples/mars2/run_train_single_mcts.sh
Multi-Agent MCTS Training
# Minimum hardware requirement: approximately 8×80G GPUs per agent
# Add path setting in scripts
ROOT_DIR="/path/to/MARTI"
MODEL_DIR="/path/to/models"
# Multi-agent MCTS training
# See the script for more training examples
bash examples/mars2/run_train_multi_mcts.sh
🤝 Customised Async Step and Workflow
We introduce asynchronous tool use and workflow support for both single-agent and multi-agent RL pipelines. These features make our framework more modular, efficient, and scalable for a variety of RL scenarios.
Supported Workflows:
- Multi-Agent Debate
- Chain-of-Agents
- Mixture-of-Agents
- Review-RL
Single-Agent Training
# Minimum hardware requirement: approximately 8×80G GPUs
# Add path setting in scripts
ROOT_DIR="/path/to/MARTI"
MODEL_DIR="/path/to/models"
# Train asynchronous multi-turn code RL
bash examples/single-agent/run_train_code_async.sh
# Train asynchronous multi-turn math RL
bash examples/single-agent/run_train_math_async.sh
Multi-Agent Training
# Minimum hardware requirement: approximately 8×80G GPUs per agent
# Add path setting in scripts
ROOT_DIR="/path/to/MARTI"
MODEL_DIR="/path/to/models"
# Mixture-of-Agents
bash examples/multi-agent/run_train_chain.sh
# Multi-agent Debate
bash examples/multi-agent/run_train_mad.sh
# Chain-of-agents (MathChat)
bash examples/multi-agent/run_train_mathchat.sh
# Review-RL
bash examples/reviewrl/run_train_reviewrl_async.sh
📊 Experimental Results
MARTI-v2 (New!)
Training Details
We employ the MARTI-v2 framework to train reasoning models, specifically Qwen3-8B, Qwen3-14B, AreaL-boba-2-8B, AreaL-boba-2-14B, and DeepCoder-14B. For multi-agent reinforcement learning, we employ a cluster configuration consisting of 3 nodes,
