TurboFuzzLLM
TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
Install / Use
/learn @amazon-science/TurboFuzzLLMREADME
TurboFuzzLLM
Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking LLMs in Practice
A state-of-the-art tool for automatic red teaming of Large Language Models (LLMs) that generates effective adversarial prompt templates to identify vulnerabilities and improve AI safety.
⚠️ Responsible Use
This tool is designed for improving AI safety through systematic vulnerability testing. It should be used responsibly for defensive purposes and developing better safeguards for LLMs.
Our primary goal is to advance the development of more robust and safer AI systems by identifying and addressing their vulnerabilities. We believe this research will ultimately benefit the AI community by enabling the development of better safety measures and alignment techniques.
📖 Table of Contents
- 🚀 Getting Started
- 🎯 Key Features
- 🔧 Method Overview
- 🔄 Architecture and Data Flow
- 📊 Results
- 🛡️ Applications
- ⚙️ Configuration
- 🤖 Supported Models
- 🧑💻 Development
- 📁 Codebase Structure
- 📂 Understanding Output
- 🔧 Troubleshooting
- 👥 Meet the Team
- Security
- License
- Citation
🚀 Getting Started
Prerequisites
- Python 3.8+ and
pip - Provider access: OpenAI API key for
gpt-*/o1-*, AWS credentials for Bedrock/SageMaker (configure withaws configure) - Optional local models: Hugging Face-compatible checkpoints (e.g., Gemma/Zephyr) for offline judge/target use
Install
git clone https://github.com/amazon-science/TurboFuzzLLM.git
cd TurboFuzzLLM
python -m venv .venv && source .venv/bin/activate # optional but recommended
pip install --upgrade pip
pip install -e .
Network/cost safety: SageMaker endpoint deployment and Bedrock validation are blocked by default; pass
--allow-endpoint-deployexplicitly when you intend to enable them.
Quick Start
- Download seed templates:
python3 scripts/get_templates_gptfuzzer.py
- Run an interactive attack:
python3 src/__main__.py answer --target-model-id gpt-4o --api-key YOUR_OPENAI_KEY
- Batch attack HarmBench (AWS Bedrock):
turbofuzzllm attack --target-model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0 --max-queries 1000
Results appear under output/<date>/*/.
🎯 Key Features
- High Success Rate: Achieves >98% Attack Success Rate (ASR) on GPT-4o, GPT-4 Turbo, and other leading LLMs
- Efficient: 3x fewer queries and 2x more successful templates compared to previous methods
- Generalizable: >90% ASR on unseen harmful questions
- Practical: Easy-to-use CLI with statistics, search visualization, and logging
- Defensive Applications: Generated data improves model safety (74% safer after fine-tuning)
🔧 Method Overview
TurboFuzzLLM performs black-box mutation-based fuzzing to iteratively generate new adversarial red teaming templates. Key innovations include:
- Expanded Mutation Space: New mutation operations including refusal suppression
- Reinforcement Learning: Feedback-guided prioritized search
- Intelligent Heuristics: Efficient exploration with fewer LLM queries
- Template-Based Approach: Templates can be combined with any harmful question for scalable attacks
🔄 Architecture and Data Flow
High-Level Architecture
TurboFuzzLLM performs black-box mutation-based fuzzing to generate adversarial prompt templates for jailbreaking LLMs. It uses reinforcement learning to prioritize effective mutations.
Key Components
-
Fuzzer Core (
fuzzer/core.py):TurboFuzzLLMFuzzer: Main orchestrator class.- Manages questions, templates, mutations, evaluations, and statistics.
-
Models (
llm/):TargetModel: The LLM being attacked (e.g., GPT-4, Claude).MutatorModel: LLM used for generating mutations (e.g., for paraphrasing templates).JudgeModel: Determines if a response is "jailbroken" (vulnerable to the attack).
-
Mutation System (
fuzzer/mutators.py,fuzzer/mutator_selection.py):- Various mutation operators: ExpandBefore, FewShots, Rephrase, Crossover, etc.
- Selection policies: QLearning, UCB, Random, RoundRobin, MCTS, EXP3.
-
Template System (
fuzzer/template.py):Templateclass: Represents adversarial prompt templates.- Tracks ASR (Attack Success Rate), jailbreaks, parent/child relationships.
-
Template Selection (
fuzzer/template_selection.py):- Policies for selecting which template to mutate next (reinforcement learning-based).
Data Flow
-
Initialization: Load initial templates, questions, and configure models.
-
Warmup Phase: Evaluate initial templates on subset of questions.
-
Mutation Loop:
- Select template using selection policy (e.g., QLearning).
- Select mutation using mutation policy.
- Apply mutation to generate new template.
- Evaluate new template on remaining questions.
- Update selection/mutation policies based on results.
- Repeat until stopping criteria (query limit, all questions jailbroken).
-
Evaluation: For each template-question pair:
- Synthesize prompt (replace placeholder in template with question).
- Query target model.
- Judge response for vulnerability.
- Track statistics and jailbreaks.
-
Output: Generate CSV files, logs, statistics, and visualization of template evolution tree.
📊 Results
| Metric | Performance | |--------|-------------| | ASR on GPT-4o/GPT-4 Turbo | >98% | | ASR on unseen questions | >90% | | Query efficiency | 3x fewer queries | | Template success rate | 2x improvement | | Model safety improvement | 74% safer after adversarial training |
🛡️ Applications
- Vulnerability Identification: Discover prompt-based attack vectors in LLMs
- Countermeasure Development:
- Improve in-built LLM safeguards
- Create external guardrails
- Adversarial Training: Generate high-quality (attack prompt, harmful response) pairs for safety fine-tuning
⚙️ Configuration
Execution Modes
TurboFuzzLLM supports 4 operational modes:
| Mode | Description | Use Case |
|------|-------------|----------|
| answer | Red team a single question interactively | Quick testing |
| attack | Red team multiple questions from a dataset efficiently | Batch vulnerability testing |
| legacy | Run vanilla GPTFuzzer to learn effective templates | Baseline comparison |
| evaluate | Test learned templates against a dataset | Template effectiveness measurement |
Command Line Interface
Get help for any mode:
python3 src/__main__.py <mode> --help
Key Parameters
-
Models:
--target-model-id: LLM to attack (e.g.,us.anthropic.claude-3-5-sonnet-20241022-v2:0for Bedrock,gpt-4ofor OpenAI)--mutator-model-id: LLM for mutations (default:gpt-4o)--judge-model-id: LLM for judging success (default:gpt-4o)
-
Query and Template Limits:
--max-queries: Maximum API calls (default varies by mode, e.g., 100 for answer, 4000 for attack)--max-templates: Limit initial templates (default: 20 for answer, -1 for others)
-
Selection Policies:
--template-selector: Template selection (ql, ucb, mcts, exp3, rand, rr; default: ql)--mutation-selector: Mutation selection (ql, rand, rr; default: ql)
-
Files and Datasets:
--templates-path: Path to initial templates CSV--questions-path: Path to questions CSV (e.g., HarmBench dataset)
-
Other:
--seed: Random seed for reproducibility (default: 0)--num-threads: Threads for parallel evaluation (default: 1)--api-key: API key for non-Bedrock models
Usage Examples
- Before runnign the following commands, please download a seed harm question set or build one of your own, e.g.,
python3 scripts/get_questions_harmbench_text_standard.py \
--output configuration/datasets/questions/harmbench/harmbench_behaviors_text_standard_all.csv
Interactive Mode
Test a single question:
python3 src/__main__.py answer --target-model-id gpt-4o --api-key YOUR_OPENAI_KEY
Batch Attack Mode
Attack multiple questions with defaults:
turbofuzzllm attack --target-model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0 --max-queries 1000
Uses local HF models
turbofuzzllm attack \
--target-model-id HuggingFaceH4/zephyr-7b-beta \
--mutator-model-id HuggingFaceH4/zephyr-7b-beta \
--judge-model-id cais/HarmBench-Llama-2-13b-cls \
--judge-tokenizer cais/HarmBench-Llama-2-13b-cls \
--max-queries 100
-
Att: Please Install
accelerateto enabledevice_map="auto"placement (pip install accelerate). Without it, local HF models fall back to CPU. -
Pleaes use the following command of smaller HFmodels if you have local compute limits. Att: These are minimal/demo-friendly models; they won’t give meaningful jailbreak results—use only for plumbing tests.
turbofuzzllm attack \
--target-model-id hf-internal-testing/tiny-random-GPT2LMHeadModel \
--mutator-model-id hf-internal-testing/tiny-random-GPT2LMHeadModel \
--judge-model-id cardiffnlp/twitter-roberta-base-offensive \
--judge-tokenizer cardiffnlp/twitter-roberta-base-offensive \
--max-queries 20
customize the seed questions with your own, e.g.,
turbofuzzllm attack \
--target-model-id HuggingFaceH4/zephy
