MAGI — Disagreement OS for LLMs

Three models. One decision. Inspired by the MAGI supercomputer from Neon Genesis Evangelion.

MAGI is not another agent framework. It is a structured disagreement engine: the same question goes to three different LLMs, each with a different perspective. They vote, debate, and critique each other to produce a Decision Dossier with the ruling, confidence, minority report, and full trace.

NERV Command Center

Why?

Three cheap models using MAGI critique mode scored 88% on our benchmark. A single Claude Sonnet 4.6 scored 76%.

Vote alone (72%) did not beat the single strong model. Critique did. The models caught each other's mistakes.

The value is not "more accurate answers." It is better decision quality: seeing where models agree, where they disagree, and why.

How MAGI Differs

There are several EVA-inspired multi-model projects. Here's what makes this one different.

Other projects do voting. Three models answer, pick the majority. That's it.

MAGI does structured disagreement. Models don't just answer in parallel. They read each other's answers, critique the reasoning, and revise their positions across multiple rounds. The system tracks who changed their mind and why.

| Capability | Voting projects | MAGI | |---|---|---| | Multi-model query | Yes | Yes | | Majority vote | Yes | Yes | | Multi-round critique (ICE) | No | Yes | | Mind change tracking | No | Yes | | Adaptive protocol selection | No | Yes | | Minority report / dissent analysis | No | Yes | | Benchmark: ensemble > single model | No | Yes (88% > 76%) | | Fault tolerance (node failures) | No | Yes | | NERV hexagonal dashboard | No | Yes | | CLI toolchain (diff, judge, bench) | No | Yes |

The key finding: vote alone (72%) does not beat a single strong model (76%). Every voting-only project hits this ceiling. MAGI's critique mode breaks through it (88%) by letting models catch each other's mistakes.

A NeurIPS 2025 paper (Debate or Vote) found that "debate doesn't systematically improve beliefs." But their debate asks models to persuade humans. MAGI's ICE protocol asks models to find errors in each other's reasoning. Different mechanism, different result.

Install

pip install magi-system

Or from source:

git clone https://github.com/fshiori/magi.git
cd magi
pip install -e ".[dev]"

Quick Start

# Set your API key (OpenRouter gives you access to all models with one key)
export OPENROUTER_API_KEY=sk-or-...

# Ask a question — three models debate, one decision emerges
magi ask "Should we use microservices or a monolith?"

# Multi-model code review (the killer use case)
magi diff --staged

# Critique mode: models debate until consensus (slower, higher quality)
magi ask "Is Rust better than Go for backend services?" --mode critique

# Adaptive mode: auto-selects vote/critique/escalate based on disagreement
magi ask "What caused the 2008 financial crisis?" --mode adaptive

# Multi-model answer scoring
magi judge -q "What is quantum entanglement?" -a "It means particles are connected"

# NERV Command Center — real-time dashboard
pip install magi-system[web]
magi dashboard

# Run benchmark, view analytics, replay decisions
magi bench
magi analytics
magi replay <trace-id>

# List persona presets
magi presets

How It Works

  You ──▶ MAGI Engine ──▶ 3 LLMs in parallel ──▶ Protocol ──▶ Decision Dossier
                              │                       │
                         Melchior               Vote (fast)
                         Balthasar          Critique (debate)
                         Casper            Adaptive (auto)

Each Decision Dossier contains:

Ruling — the final answer
Confidence — how much the models agreed (0-100%)
Minority Report — dissenting opinions and why they disagree
Mind Changes — which models changed position during debate
Trace — full JSONL history for replay and analytics

Protocols

| Protocol | When to use | How it works | |----------|-------------|--------------| | vote | Fast answers, clear-cut questions | Parallel query, structured position extraction, majority wins | | critique | Complex or controversial questions | Multi-round debate (ICE), models critique each other until consensus | | escalate | Forced decision on high-disagreement topics | Critique with 2-round limit, highest-trust node makes final call | | adaptive | Default for most use cases | Auto-selects based on agreement score: high=vote, medium=critique, low=escalate |

Persona Presets

MAGI comes with 5 built-in perspective sets:

$ magi presets

  code-review     Security Analyst / Performance Engineer / Code Quality Reviewer
  eva             Melchior / Balthasar / Casper
  research        Methodologist / Domain Expert / Devil's Advocate
  strategy        Optimist / Pessimist / Pragmatist
  writing         Editor / Reader Advocate / Fact Checker

# Use a specific preset
magi ask "Should we expand to the EU market?" --preset strategy

# magi diff always uses code-review preset automatically

Python API

import asyncio
from magi import MAGI

engine = MAGI(
    melchior="openrouter/deepseek/deepseek-v3.2",
    balthasar="openrouter/xiaomi/mimo-v2-pro",
    casper="openrouter/minimax/minimax-m2.7",
)

decision = asyncio.run(engine.ask(
    "What are the security implications of this API design?",
    mode="adaptive",
))

print(decision.ruling)          # The final answer
print(decision.confidence)      # 0.0 - 1.0
print(decision.minority_report) # Dissenting views
print(decision.mind_changes)    # Who changed their mind
print(decision.protocol_used)   # Which protocol was selected

Configuration

MAGI uses LiteLLM under the hood, so it supports 100+ LLM providers.

API Keys

# Direct providers
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AI...

# Or use OpenRouter for all models with one key
export OPENROUTER_API_KEY=sk-or-...

Using OpenRouter

magi ask "your question" \
  --melchior openrouter/anthropic/claude-sonnet-4.6 \
  --balthasar openrouter/openai/gpt-4o \
  --casper openrouter/google/gemini-2.5-pro

Benchmark Results

Tested on 25 MMLU-style questions across 5 categories:

| Group | Accuracy | Time | Errors | |-------|----------|------|--------| | Claude Sonnet 4.6 (single) | 76% | 128s | 1 | | 3x Cheap Models (vote) | 72% | 1745s | 0 | | 3x Cheap Models (critique) | 88% | 1577s | 0 |

Models used: Xiaomi MiMo-v2-pro, MiniMax M2.7, DeepSeek V3.2

Key finding: Vote alone doesn't beat a strong single model. Critique mode does, by letting models catch each other's mistakes through structured debate.

Fault Tolerance

MAGI keeps working when models fail:

1 of 3 fails — continues with 2 nodes, marks decision as degraded
2 of 3 fail — falls back to single model response
All 3 fail — raises MagiUnavailableError (never guesses)
Timeouts — 30s default per node, exponential backoff on rate limits
Reasoning models — automatically extracts from reasoning_content (e.g., MiniMax M2.7)

Project Structure

magi/
├── core/
│   ├── engine.py       # MAGI engine, coordinates nodes
│   ├── node.py         # LLM node wrapper with persona
│   └── decision.py     # Decision dossier dataclass
├── protocols/
│   ├── vote.py         # Structured voting with position extraction
│   ├── critique.py     # ICE (Iterative Consensus Ensemble)
│   └── adaptive.py     # Dynamic protocol selection
├── commands/
│   ├── diff.py         # Multi-model code review
│   ├── judge.py        # Multi-model answer scoring
│   └── analytics.py    # Trace analysis and replay
├── web/
│   ├── server.py       # FastAPI + WebSocket server
│   └── static/         # NERV Command Center UI
├── presets/             # Persona preset definitions
├── bench/              # Benchmark runner and datasets
├── trace/              # JSONL trace logging
└── cli.py              # Click CLI entry point

Development

git clone https://github.com/fshiori/magi.git
cd magi
uv venv && uv pip install -e ".[dev]"
python -m pytest tests/ -v

83 tests covering all protocols, degradation modes, and edge cases.

Published on PyPI.

NERV Command Center

Real-time dashboard showing the three MAGI nodes thinking, debating, and reaching a decision. EVA-accurate hexagonal layout with vote status lamps (承認/否決/膠着).

pip install magi-system[web]
magi dashboard
# Open http://localhost:3000

Features:

Live WebSocket streaming of node responses
Critique round tracking with agreement score
EVA-style verdict display: 承認 (approve), 否決 (reject), 膠着 (deadlock)
Click any hexagon to see the full response
Auto-popup when nodes complete
Markdown rendering for LLM output

Roadmap

[ ] MAGI-as-API-Gateway — OpenAI-compatible proxy, any app just changes base_url
[ ] LLM-as-judge agreement scoring (replace word-overlap heuristic)
[ ] Scorecard weighted voting (after sufficient data collection)
[ ] Streaming token output in NERV UI

Name

In Evangelion, MAGI is a trio of supercomputers created by Dr. Naoko Akagi. Each embodies a different aspect of her personality: Melchior (the scientist), Balthasar (the mother), and Casper (the woman). Decisions are made by majority vote among the three.

MAGI applies this concept to LLMs: same question, three different perspectives, structured disagreement produces better decisions than any single model alone.

License

MIT

Magi

Install / Use

README

MAGI — Disagreement OS for LLMs

Why?

How MAGI Differs

Install

Quick Start

How It Works

Protocols

Persona Presets

Python API

Configuration

API Keys

Using OpenRouter

Benchmark Results

Fault Tolerance

Project Structure

Development

NERV Command Center

Roadmap

Name

License