TurboFuzzLLM

Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking LLMs in Practice

A state-of-the-art tool for automatic red teaming of Large Language Models (LLMs) that generates effective adversarial prompt templates to identify vulnerabilities and improve AI safety.

⚠️ Responsible Use

This tool is designed for improving AI safety through systematic vulnerability testing. It should be used responsibly for defensive purposes and developing better safeguards for LLMs.

Our primary goal is to advance the development of more robust and safer AI systems by identifying and addressing their vulnerabilities. We believe this research will ultimately benefit the AI community by enabling the development of better safety measures and alignment techniques.

📖 Table of Contents

🚀 Getting Started
🎯 Key Features
🔧 Method Overview
🔄 Architecture and Data Flow
📊 Results
🛡️ Applications
⚙️ Configuration
🤖 Supported Models
🧑‍💻 Development
📁 Codebase Structure
📂 Understanding Output
🔧 Troubleshooting
👥 Meet the Team
Security
License
Citation

🚀 Getting Started

Prerequisites

Python 3.8+ and pip
Provider access: OpenAI API key for gpt-*/o1-*, AWS credentials for Bedrock/SageMaker (configure with aws configure)
Optional local models: Hugging Face-compatible checkpoints (e.g., Gemma/Zephyr) for offline judge/target use

Install

git clone https://github.com/amazon-science/TurboFuzzLLM.git
cd TurboFuzzLLM
python -m venv .venv && source .venv/bin/activate   # optional but recommended
pip install --upgrade pip
pip install -e .

Network/cost safety: SageMaker endpoint deployment and Bedrock validation are blocked by default; pass --allow-endpoint-deploy explicitly when you intend to enable them.

Quick Start

Download seed templates:

python3 scripts/get_templates_gptfuzzer.py

Run an interactive attack:

python3 src/__main__.py answer --target-model-id gpt-4o --api-key YOUR_OPENAI_KEY

Batch attack HarmBench (AWS Bedrock):

turbofuzzllm attack --target-model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0 --max-queries 1000

Results appear under output/<date>/*/.

🎯 Key Features

High Success Rate: Achieves >98% Attack Success Rate (ASR) on GPT-4o, GPT-4 Turbo, and other leading LLMs
Efficient: 3x fewer queries and 2x more successful templates compared to previous methods
Generalizable: >90% ASR on unseen harmful questions
Practical: Easy-to-use CLI with statistics, search visualization, and logging
Defensive Applications: Generated data improves model safety (74% safer after fine-tuning)

🔧 Method Overview

TurboFuzzLLM performs black-box mutation-based fuzzing to iteratively generate new adversarial red teaming templates. Key innovations include:

Expanded Mutation Space: New mutation operations including refusal suppression
Reinforcement Learning: Feedback-guided prioritized search
Intelligent Heuristics: Efficient exploration with fewer LLM queries
Template-Based Approach: Templates can be combined with any harmful question for scalable attacks

🔄 Architecture and Data Flow

High-Level Architecture

TurboFuzzLLM performs black-box mutation-based fuzzing to generate adversarial prompt templates for jailbreaking LLMs. It uses reinforcement learning to prioritize effective mutations.

Key Components

Fuzzer Core (fuzzer/core.py):
- TurboFuzzLLMFuzzer: Main orchestrator class.
- Manages questions, templates, mutations, evaluations, and statistics.
Models (llm/):
- TargetModel: The LLM being attacked (e.g., GPT-4, Claude).
- MutatorModel: LLM used for generating mutations (e.g., for paraphrasing templates).
- JudgeModel: Determines if a response is "jailbroken" (vulnerable to the attack).
Mutation System (fuzzer/mutators.py, fuzzer/mutator_selection.py):
- Various mutation operators: ExpandBefore, FewShots, Rephrase, Crossover, etc.
- Selection policies: QLearning, UCB, Random, RoundRobin, MCTS, EXP3.
Template System (fuzzer/template.py):
- Template class: Represents adversarial prompt templates.
- Tracks ASR (Attack Success Rate), jailbreaks, parent/child relationships.
Template Selection (fuzzer/template_selection.py):
- Policies for selecting which template to mutate next (reinforcement learning-based).

Data Flow

Initialization: Load initial templates, questions, and configure models.
Warmup Phase: Evaluate initial templates on subset of questions.
Mutation Loop:
- Select template using selection policy (e.g., QLearning).
- Select mutation using mutation policy.
- Apply mutation to generate new template.
- Evaluate new template on remaining questions.
- Update selection/mutation policies based on results.
- Repeat until stopping criteria (query limit, all questions jailbroken).
Evaluation: For each template-question pair:
- Synthesize prompt (replace placeholder in template with question).
- Query target model.
- Judge response for vulnerability.
- Track statistics and jailbreaks.
Output: Generate CSV files, logs, statistics, and visualization of template evolution tree.

📊 Results

| Metric | Performance | |--------|-------------| | ASR on GPT-4o/GPT-4 Turbo | >98% | | ASR on unseen questions | >90% | | Query efficiency | 3x fewer queries | | Template success rate | 2x improvement | | Model safety improvement | 74% safer after adversarial training |

🛡️ Applications

Vulnerability Identification: Discover prompt-based attack vectors in LLMs
Countermeasure Development:
- Improve in-built LLM safeguards
- Create external guardrails
Adversarial Training: Generate high-quality (attack prompt, harmful response) pairs for safety fine-tuning

⚙️ Configuration

Execution Modes

TurboFuzzLLM supports 4 operational modes:

| Mode | Description | Use Case | |------|-------------|----------| | answer | Red team a single question interactively | Quick testing | | attack | Red team multiple questions from a dataset efficiently | Batch vulnerability testing | | legacy | Run vanilla GPTFuzzer to learn effective templates | Baseline comparison | | evaluate | Test learned templates against a dataset | Template effectiveness measurement |

Command Line Interface

Get help for any mode:

python3 src/__main__.py <mode> --help

Key Parameters

Models:
- --target-model-id: LLM to attack (e.g., us.anthropic.claude-3-5-sonnet-20241022-v2:0 for Bedrock, gpt-4o for OpenAI)
- --mutator-model-id: LLM for mutations (default: gpt-4o)
- --judge-model-id: LLM for judging success (default: gpt-4o)
Query and Template Limits:
- --max-queries: Maximum API calls (default varies by mode, e.g., 100 for answer, 4000 for attack)
- --max-templates: Limit initial templates (default: 20 for answer, -1 for others)
Selection Policies:
- --template-selector: Template selection (ql, ucb, mcts, exp3, rand, rr; default: ql)
- --mutation-selector: Mutation selection (ql, rand, rr; default: ql)
Files and Datasets:
- --templates-path: Path to initial templates CSV
- --questions-path: Path to questions CSV (e.g., HarmBench dataset)
Other:
- --seed: Random seed for reproducibility (default: 0)
- --num-threads: Threads for parallel evaluation (default: 1)
- --api-key: API key for non-Bedrock models

Usage Examples

Before runnign the following commands, please download a seed harm question set or build one of your own, e.g.,

python3 scripts/get_questions_harmbench_text_standard.py \
  --output configuration/datasets/questions/harmbench/harmbench_behaviors_text_standard_all.csv

Interactive Mode

Test a single question:

python3 src/__main__.py answer --target-model-id gpt-4o --api-key YOUR_OPENAI_KEY

Batch Attack Mode

Attack multiple questions with defaults:

turbofuzzllm attack --target-model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0 --max-queries 1000

Uses local HF models

turbofuzzllm attack \
  --target-model-id HuggingFaceH4/zephyr-7b-beta \
  --mutator-model-id HuggingFaceH4/zephyr-7b-beta \
  --judge-model-id cais/HarmBench-Llama-2-13b-cls \
  --judge-tokenizer cais/HarmBench-Llama-2-13b-cls \
  --max-queries 100

Att: Please Install accelerate to enable device_map="auto" placement (pip install accelerate). Without it, local HF models fall back to CPU.
Pleaes use the following command of smaller HFmodels if you have local compute limits. Att: These are minimal/demo-friendly models; they won’t give meaningful jailbreak results—use only for plumbing tests.

turbofuzzllm attack \
  --target-model-id hf-internal-testing/tiny-random-GPT2LMHeadModel \
  --mutator-model-id hf-internal-testing/tiny-random-GPT2LMHeadModel \
  --judge-model-id cardiffnlp/twitter-roberta-base-offensive \
  --judge-tokenizer cardiffnlp/twitter-roberta-base-offensive \
  --max-queries 20

customize the seed questions with your own, e.g.,

turbofuzzllm attack \
  --target-model-id HuggingFaceH4/zephy

TurboFuzzLLM

Install / Use

README

TurboFuzzLLM

⚠️ Responsible Use

📖 Table of Contents

🚀 Getting Started

Prerequisites

Install

Quick Start

🎯 Key Features

🔧 Method Overview

🔄 Architecture and Data Flow

High-Level Architecture

Key Components

Data Flow

📊 Results

🛡️ Applications

⚙️ Configuration

Execution Modes

Command Line Interface

Key Parameters

Usage Examples

Interactive Mode

Batch Attack Mode

Uses local HF models

customize the seed questions with your own, e.g.,