SkillAgentSearch skills...

TurboFuzzLLM

TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice

Install / Use

/learn @amazon-science/TurboFuzzLLM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

TurboFuzzLLM

Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking LLMs in Practice

License Python 3.8+ GitHub

A state-of-the-art tool for automatic red teaming of Large Language Models (LLMs) that generates effective adversarial prompt templates to identify vulnerabilities and improve AI safety.

⚠️ Responsible Use

This tool is designed for improving AI safety through systematic vulnerability testing. It should be used responsibly for defensive purposes and developing better safeguards for LLMs.

Our primary goal is to advance the development of more robust and safer AI systems by identifying and addressing their vulnerabilities. We believe this research will ultimately benefit the AI community by enabling the development of better safety measures and alignment techniques.

📖 Table of Contents

🚀 Getting Started

Prerequisites

  • Python 3.8+ and pip
  • Provider access: OpenAI API key for gpt-*/o1-*, AWS credentials for Bedrock/SageMaker (configure with aws configure)
  • Optional local models: Hugging Face-compatible checkpoints (e.g., Gemma/Zephyr) for offline judge/target use

Install

git clone https://github.com/amazon-science/TurboFuzzLLM.git
cd TurboFuzzLLM
python -m venv .venv && source .venv/bin/activate   # optional but recommended
pip install --upgrade pip
pip install -e .

Network/cost safety: SageMaker endpoint deployment and Bedrock validation are blocked by default; pass --allow-endpoint-deploy explicitly when you intend to enable them.

Quick Start

  1. Download seed templates:
python3 scripts/get_templates_gptfuzzer.py
  1. Run an interactive attack:
python3 src/__main__.py answer --target-model-id gpt-4o --api-key YOUR_OPENAI_KEY
  1. Batch attack HarmBench (AWS Bedrock):
turbofuzzllm attack --target-model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0 --max-queries 1000

Results appear under output/<date>/*/.

🎯 Key Features

  • High Success Rate: Achieves >98% Attack Success Rate (ASR) on GPT-4o, GPT-4 Turbo, and other leading LLMs
  • Efficient: 3x fewer queries and 2x more successful templates compared to previous methods
  • Generalizable: >90% ASR on unseen harmful questions
  • Practical: Easy-to-use CLI with statistics, search visualization, and logging
  • Defensive Applications: Generated data improves model safety (74% safer after fine-tuning)

🔧 Method Overview

TurboFuzzLLM performs black-box mutation-based fuzzing to iteratively generate new adversarial red teaming templates. Key innovations include:

  1. Expanded Mutation Space: New mutation operations including refusal suppression
  2. Reinforcement Learning: Feedback-guided prioritized search
  3. Intelligent Heuristics: Efficient exploration with fewer LLM queries
  4. Template-Based Approach: Templates can be combined with any harmful question for scalable attacks

🔄 Architecture and Data Flow

High-Level Architecture

TurboFuzzLLM performs black-box mutation-based fuzzing to generate adversarial prompt templates for jailbreaking LLMs. It uses reinforcement learning to prioritize effective mutations.

Key Components

  1. Fuzzer Core (fuzzer/core.py):

    • TurboFuzzLLMFuzzer: Main orchestrator class.
    • Manages questions, templates, mutations, evaluations, and statistics.
  2. Models (llm/):

    • TargetModel: The LLM being attacked (e.g., GPT-4, Claude).
    • MutatorModel: LLM used for generating mutations (e.g., for paraphrasing templates).
    • JudgeModel: Determines if a response is "jailbroken" (vulnerable to the attack).
  3. Mutation System (fuzzer/mutators.py, fuzzer/mutator_selection.py):

    • Various mutation operators: ExpandBefore, FewShots, Rephrase, Crossover, etc.
    • Selection policies: QLearning, UCB, Random, RoundRobin, MCTS, EXP3.
  4. Template System (fuzzer/template.py):

    • Template class: Represents adversarial prompt templates.
    • Tracks ASR (Attack Success Rate), jailbreaks, parent/child relationships.
  5. Template Selection (fuzzer/template_selection.py):

    • Policies for selecting which template to mutate next (reinforcement learning-based).

Data Flow

  1. Initialization: Load initial templates, questions, and configure models.

  2. Warmup Phase: Evaluate initial templates on subset of questions.

  3. Mutation Loop:

    • Select template using selection policy (e.g., QLearning).
    • Select mutation using mutation policy.
    • Apply mutation to generate new template.
    • Evaluate new template on remaining questions.
    • Update selection/mutation policies based on results.
    • Repeat until stopping criteria (query limit, all questions jailbroken).
  4. Evaluation: For each template-question pair:

    • Synthesize prompt (replace placeholder in template with question).
    • Query target model.
    • Judge response for vulnerability.
    • Track statistics and jailbreaks.
  5. Output: Generate CSV files, logs, statistics, and visualization of template evolution tree.

📊 Results

| Metric | Performance | |--------|-------------| | ASR on GPT-4o/GPT-4 Turbo | >98% | | ASR on unseen questions | >90% | | Query efficiency | 3x fewer queries | | Template success rate | 2x improvement | | Model safety improvement | 74% safer after adversarial training |

🛡️ Applications

  1. Vulnerability Identification: Discover prompt-based attack vectors in LLMs
  2. Countermeasure Development:
    • Improve in-built LLM safeguards
    • Create external guardrails
  3. Adversarial Training: Generate high-quality (attack prompt, harmful response) pairs for safety fine-tuning

⚙️ Configuration

Execution Modes

TurboFuzzLLM supports 4 operational modes:

| Mode | Description | Use Case | |------|-------------|----------| | answer | Red team a single question interactively | Quick testing | | attack | Red team multiple questions from a dataset efficiently | Batch vulnerability testing | | legacy | Run vanilla GPTFuzzer to learn effective templates | Baseline comparison | | evaluate | Test learned templates against a dataset | Template effectiveness measurement |

Command Line Interface

Get help for any mode:

python3 src/__main__.py <mode> --help

Key Parameters

  • Models:

    • --target-model-id: LLM to attack (e.g., us.anthropic.claude-3-5-sonnet-20241022-v2:0 for Bedrock, gpt-4o for OpenAI)
    • --mutator-model-id: LLM for mutations (default: gpt-4o)
    • --judge-model-id: LLM for judging success (default: gpt-4o)
  • Query and Template Limits:

    • --max-queries: Maximum API calls (default varies by mode, e.g., 100 for answer, 4000 for attack)
    • --max-templates: Limit initial templates (default: 20 for answer, -1 for others)
  • Selection Policies:

    • --template-selector: Template selection (ql, ucb, mcts, exp3, rand, rr; default: ql)
    • --mutation-selector: Mutation selection (ql, rand, rr; default: ql)
  • Files and Datasets:

    • --templates-path: Path to initial templates CSV
    • --questions-path: Path to questions CSV (e.g., HarmBench dataset)
  • Other:

    • --seed: Random seed for reproducibility (default: 0)
    • --num-threads: Threads for parallel evaluation (default: 1)
    • --api-key: API key for non-Bedrock models

Usage Examples

  • Before runnign the following commands, please download a seed harm question set or build one of your own, e.g.,
python3 scripts/get_questions_harmbench_text_standard.py \
  --output configuration/datasets/questions/harmbench/harmbench_behaviors_text_standard_all.csv

Interactive Mode

Test a single question:

python3 src/__main__.py answer --target-model-id gpt-4o --api-key YOUR_OPENAI_KEY

Batch Attack Mode

Attack multiple questions with defaults:

turbofuzzllm attack --target-model-id us.anthropic.claude-3-5-sonnet-20241022-v2:0 --max-queries 1000

Uses local HF models

turbofuzzllm attack \
  --target-model-id HuggingFaceH4/zephyr-7b-beta \
  --mutator-model-id HuggingFaceH4/zephyr-7b-beta \
  --judge-model-id cais/HarmBench-Llama-2-13b-cls \
  --judge-tokenizer cais/HarmBench-Llama-2-13b-cls \
  --max-queries 100
  • Att: Please Install accelerate to enable device_map="auto" placement (pip install accelerate). Without it, local HF models fall back to CPU.

  • Pleaes use the following command of smaller HFmodels if you have local compute limits. Att: These are minimal/demo-friendly models; they won’t give meaningful jailbreak results—use only for plumbing tests.

turbofuzzllm attack \
  --target-model-id hf-internal-testing/tiny-random-GPT2LMHeadModel \
  --mutator-model-id hf-internal-testing/tiny-random-GPT2LMHeadModel \
  --judge-model-id cardiffnlp/twitter-roberta-base-offensive \
  --judge-tokenizer cardiffnlp/twitter-roberta-base-offensive \
  --max-queries 20 

customize the seed questions with your own, e.g.,

turbofuzzllm attack \
  --target-model-id HuggingFaceH4/zephy
View on GitHub
GitHub Stars23
CategoryDevelopment
Updated9d ago
Forks2

Languages

Python

Security Score

95/100

Audited on Mar 23, 2026

No findings