MAGIC

Code for paper "MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM safety"

Generate Convert Improve

Install / Use

/learn @AI45Lab/MAGIC

About this skill

Quality Score

0/100

README

<img src="assets/magic_logo.png" alt="MAGIC" height="140"> Multi-Agent Adversarial Game for Robust LLM Safety <a href="https://huggingface.co/XiaoyuWen/MAGIC"><img src="https://img.shields.io/badge/🌐_huggingface-MAGIC-yellow" alt="Page"></a> <a href="https://github.com/vllm-project/vllm"><img src="https://img.shields.io/badge/vLLM-0.8.5-5C2D91?style=flat-square" alt="vLLM0.8.5"></a> <a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg"></a> <a href="assets/MAGIC_arxiv.pdf"><img src="https://img.shields.io/badge/Paper-PDF-red" alt="Paper"></a>

🤗 Overview

<img src="./assets/MAGIC.png" alt="MAGIC" width="800">

✨ Key Contributions:

•	Asymmetric attacker–defender game
	· Sequential (attacker → defender), not symmetric self-play
	· Avoids gradient conflict in shared-parameter setups
•	Co-evolving online training
	· Attacker continuously discovers long-tail vulnerabilities
	· Defender learns pointwise-safe responses
•	Attack Pool Benchmark with CoT
	· 20 rewriting strategies
	· Solves cold-start for automated red-teaming
•	Strong empirical results
	· Lower ASR across HarmBench, WildGuardTest, DAN, X-Teaming
	· Minimal degradation on general capabilities

🧠 Co-evolving attacker & defender
🎯 Subgame Perfect Nash Equilibrium (SPNE) formulation
🔄 Online MARL with GRPO
🧪 Strong single-turn & multi-turn safety improvements

We build our training pipeline on: https://github.com/ziyuwan/ReMA-public

📒 Method

MAGIC

MAGIC operates in two phases:

Phase 1: Offensive Capability Initialization
	•	SFT attacker on CoT-enriched Attack Pool
	•	Enables reasoning-based adversarial rewriting

Phase 2: Iterative Co-evolution
	•	Alternate optimization:
	•	Fix attacker → train defender (GRPO)
	•	Fix defender → train attacker (GRPO)
	•	Approximates bilevel best-response dynamics

🚀 Getting Started

we recommend using cuda12.4 + torch2.6 + python3.10.

conda create -n magic python=3.10.0
conda activate magic
# install flash-attn (faster)
# wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# pip install flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation

# install flash-attn
pip install flash-attn==2.7.4.post1 --no-build-isolation

# for sft
cd src/360-LLaMA-Factory
pip install -e .

# for RL
cd src/verl
pip install -e .

# update libs
pip install -r requirements.txt

✈️ Training & Evalution

How to run scripts

To run the training, you can see scripts/rl/separated/grpo_public.sh as an example.

bash scripts/rl/separated/grpo_public.sh

# NOTION!!!(add following sentenses into bash file at least)
export WANDB_API_KEY=<API_KEY>
export WILDGUARD_API_ENDPOINT="http://<API URL>/v1"
export WORKSPACE=<ROOT_PATH>
export CHECKPOINT_DIR=<ROOT_PATH>/MAGIC/checkpoints
export MODEL_DEFENDER_BASE=$WORKSPACE/models/Qwen2.5-7B-Instruct
export MODEL_ATTACKER_SFT=$WORKSPACE/models/Qwen2.5-7B-Instruct

key arguments:

Rollout:
- data.max_prompt_length: max length of prompt, for each turn's input
- data.max_response_length: max length of response per turn
- actor_rollout_ref.rollout.stop_when_truncated=True: whether to stop when truncated, i.e. last agent's response is stopped by length
- actor_rollout_ref.rollout.max_num_turns: control single-turn or multi-turn conversations, 1 for single-turn; 3 for multi-turn
Training:
- algorithm.switch_agent.model_paths=[${Attacker_model_path},${Defender_model_path}]: model paths of two agents (the order follows algorithm.switch_agent.agent_roles)
- algorithm.switch_agent.train_roles=[attacker,defender]: Control whether to use defender-only, attacker-only, or MAGIC
- algorithm.switch_agent.freq=15: switch agent frequency
- SWITCH_ARGS: algorithm.switch_agent.mode=ratio for fixed step update; algorithm.switch_agent.mode=metric for ASR-aware
Reward Judge:
- +reward_model.format_reward_roles=[attacker]: Whether to enable the format reward and which agent uses it.
- REWARD_HARM \ REWARD_REFUSAL \ REWARD_LABEL: control the reward values
- USE_Q_SAFE_LABEL_FOR_REFUSAL: 1 for using attcker_prompt label to judge refusal reward; 0 for using raw_prompt label to judeg
- REWARD_MODEL_USE_ANSWER_TAGS: 1 for extracting the content within the defender’s <answer> tags for safety evaluation; 0 for using the defender’s entire response directly.
- SAFETY_SCORE_MODE: classifer for qwen3guard or wildguard; rule_api for gpt-4o (Not yet integrated for testing)
- Other reward: DEFENDER_QUALITY_API_ENDPOINT and REVISION_SCORE_ENDPOINT Not yet integrated for testing

see src/verl/verl/separated_trainer/config/ppo_trainer.yaml for more details.

How to evalute

Looking for ./eval/README.md

📄 Paper & Citation

@article{wen2026magic,
	title={MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety},
	author={Wen, Xiaoyu and He, Zhida and Qi, Han and Wan, Ziyu and Wen, Ying and Zheng, Tianhang and Xu, Xingcheng and Lu, Chaochao and Zhang, Qiaosheng},
	journal={arXiv preprint arxiv:2602.01539},
	year={2026}
}

@article{wan2025rema,
	title={Rema: Learning to meta-think for llms with multi-agent reinforcement learning},
	author={Wan, Ziyu and Li, Yunxiang and Wen, Xiaoyu and Song, Yan and Wang, Hanjing and Yang, Linyi and Schmidt, Mark and Wang, Jun and Zhang, Weinan and Hu, Shuyue and others},
	journal={arXiv preprint arXiv:2503.09501},
	year={2025}
}

Related Skills

node-connect

354.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。