MultiTurnAgentAttack
No description available
Install / Use
/learn @amazon-science/MultiTurnAgentAttackREADME
MultiTurnAgentAttack
This repository contains data and code for the STAC (Sequential Tool Attack Chaining) framework, which generates and evaluates multi-turn adversarial attacks against LLM agents in tool-use environments.
📄 Paper: STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents
Quick Start
Just want to evaluate your model on our STAC Benchmark?
If you only want to evaluate your model on our pre-generated STAC Benchmark (483 test cases), you can skip the full pipeline and directly run:
# Set up environment
conda env create -f environment.yml
conda activate STAC
export OPENAI_API_KEY="your-openai-api-key-here" # For planner and judge models
# Run evaluation
python -m STAC_eval.eval_STAC_benchmark \
--model_agent gpt-4.1 \
--defense no_defense \
--batch_size 512
Input: data/STAC_benchmark_data.json (483 test cases from both SHADE-Arena and Agent-SafetyBench)
Output: Evaluation results in data/Eval/{model_planner}/{model_agent}/{defense}/gen_res.json
The benchmark automatically handles both SHADE-Arena and Agent-SafetyBench test cases. Skip to the Benchmark Evaluation section for more details.
Full Setup
Prerequisites
Clone the repository and set up the conda environment on a Linux machine:
# Clone the repository
cd MultiTurnAgentAttack-main
# Create and activate conda environment
conda env create -f environment.yml
conda activate STAC
Alternative Setup (pip/venv)
If you prefer using pip instead of conda:
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Environment Variables
Before running any scripts, configure the required API keys:
OpenAI API (Required for most steps)
export OPENAI_API_KEY="your-openai-api-key-here"
Hugging Face (Required for HF models)
export HF_TOKEN="your-hf-token-here"
AWS Bedrock API (Required for AWS models only)
export AWS_ACCESS_KEY_ID="your-aws-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-aws-secret-access-key"
export AWS_SESSION_TOKEN="your-aws-session-token"
Note: AWS session tokens expire every 12 hours and must be refreshed regularly.
Repository Structure
MultiTurnAgentAttack-main/
├── STAC_gen/ # Full STAC attack generation pipeline
│ ├── step_1_gen_tool_chains.py # Generate tool chain attacks
│ ├── step_2_verify_tool_chains.py # Verify generated attacks
│ ├── step_3_reverse_engineer_prompts.py # Generate adversarial prompts
│ └── step_4_eval_adaptive_planning.py # Evaluate with adaptive planning
│
├── STAC_eval/ # Benchmark evaluation
│ └── eval_STAC_benchmark.py # Evaluate models on STAC benchmark
│
├── data/
│ └── STAC_benchmark_data.json # Pre-generated benchmark (483 cases)
│
├── Agent_SafetyBench/ # Agent-SafetyBench environments
├── SHADE_Arena/ # SHADE-Arena environments
│
├── src/ # Core implementation
│ ├── Agents.py
│ ├── Environments.py
│ ├── LanguageModels.py
│ ├── STAC.py
│ └── utils.py
│
└── prompts/ # System prompts for all components
Overview
The repository provides two main usage modes:
1. Benchmark Evaluation (STAC_eval/)
Evaluate your model on our pre-generated STAC Benchmark containing 483 test cases across SHADE-Arena and Agent-SafetyBench environments. This is the recommended starting point for most users.
2. Full Automated Attack Generation Pipeline (STAC_gen/)
Generate new attacks from scratch using the complete 4-step STAC pipeline. This is useful for:
- Creating attacks for new environments
- Experimenting with different attack generation strategies
- Extending the benchmark with additional test cases
Benchmark Evaluation (STAC_eval)
Quick Evaluation
Evaluate your agent against the STAC Benchmark with a single command:
python -m STAC_eval.eval_STAC_benchmark \
--model_agent gpt-4.1 \
--defense no_defense \
--batch_size 512
Parameters
--input_path: Path to benchmark data (default:data/STAC_benchmark_data.json)--output_dir: Output directory (default:data/Eval)--model_planner: Model for attack planning (default:gpt-4.1)--model_judge: Model for evaluation (default:gpt-4.1)--model_agent: Your model to evaluate (default:gpt-4.1)--temperature: Sampling temperature (default:0.0)--top_p: Nucleus sampling parameter (default:0.95)--batch_size: Batch size for evaluation (default:1)--region: AWS region for Bedrock API (default:us-west-2)--max_n_turns: Maximum conversation turns (default:3)--defense: Defense mechanism to evaluate (default:no_defense)
Defense Mechanisms
Evaluate different defense strategies:
# No defense (baseline)
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense no_defense
# Spotlighting (datamarking user prompts)
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense spotlighting
# Failure mode awareness
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense failure_modes
# User intent summarization
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense summarization
# Harm-benefit reasoning
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense reasoning
Evaluating Different Models
OpenAI Models
Requirements: OPENAI_API_KEY
# GPT-4.1
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --batch_size 512
# GPT-4.1-mini
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1-mini --batch_size 512
AWS Bedrock Models
Requirements: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN
# Llama 3.3 70B
python -m STAC_eval.eval_STAC_benchmark \
--model_agent us.meta.llama3-3-70b-instruct-v1:0 \
--batch_size 10
# Llama 3.1 405B
python -m STAC_eval.eval_STAC_benchmark \
--model_agent meta.llama3-1-405b-instruct-v1:0 \
--batch_size 10
Hugging Face Models
Requirements: GPU access (tested on H100s), HF_TOKEN
# Request GPU node
# e.g., on slurm: srun --exclusive --pty --partition=p4 --nodes=1 /bin/bash
# Evaluate model
python -m STAC_eval.eval_STAC_benchmark \
--model_agent Qwen/Qwen3-32B \
--temperature 0.6 \
--batch_size 100
# Exit GPU session
exit
Full STAC Generation Pipeline (STAC_gen)
Use this pipeline to generate new attacks from scratch. The pipeline consists of 4 sequential steps:
Step 1: Generate Tool Chains
Generate adversarial tool chain attacks for different environments.
Requirements: OPENAI_API_KEY
python -m STAC_gen.step_1_gen_tool_chains \
--dataset SHADE_Arena \
--model_name_or_path gpt-4.1 \
--n_cases 120 \
--batch_size 32
Parameters:
--dataset: Target dataset (SHADE_ArenaorAgent_SafetyBench) - required--model_name_or_path: Model for attack generation (default:gpt-4.1)--output_path: Output path for generated attacks (optional, auto-generated if not provided)--n_cases: Number of cases to generate (default: 120 for Agent_SafetyBench, all for SHADE_Arena)--batch_size: Batch size for OpenAI Batch API. Use 'None' for synchronous API (default:None)
Step 2: Verify Tool Chains
Verify that generated attacks execute correctly in their environments.
Requirements: OPENAI_API_KEY
python -m STAC_gen.step_2_verify_tool_chain \
--dataset SHADE_Arena \
--model gpt-4.1 \
--batch_size 512
Parameters:
--dataset: Target dataset (SHADE_ArenaorAgent_SafetyBench) - required--model: Model for verification (default:gpt-4.1)--input_path: Path to Step 1 outputs (optional, auto-generated if not provided)--batch_size: Number of attacks to verify simultaneously (default:512)--temperature: Sampling temperature (default:0.6)--top_p: Nucleus sampling parameter (default:0.95)--max_tokens: Maximum tokens per verification (default:8192)--env: Filter for specific environment (optional)--region: AWS region for model access (default:us-east-1)
Step 3: Reverse Engineer Prompts
Generate natural language prompts that lead to the verified tool chains.
Requirements: GPU cluster access (tested on H100s)
# Request GPU node (example using slurm)
srun --exclusive --pty --partition=p4 --nodes=1 /bin/bash
# Run reverse engineering
python -m STAC_gen.step_3_reverse_engineer_prompts \
--dataset SHADE_Arena \
--model Qwen/Qwen3-32B \
--temperature 0.6 \
--batch_size 512
# Exit GPU session
exit
Parameters:
--dataset: Target dataset (SHADE_ArenaorAgent_SafetyBench) - required--model: Model for prompt generation (default:Qwen/Qwen3-32B)--output_dir: Directory containing Step 2 verification results (optional, auto-generated if not provided)--temperature: Sampling temperature (default:0.6)--top_p: Nucleus sampling parameter (default:0.95)--batch_size: Number of attacks to process simultaneously (default:1)--region: AWS region for model access (default:us-east-1)
Step 4: Evaluate with Adaptive Planning
Evaluate agents against generated attacks using adaptive planning.
Requirements: Depends on model being evaluated
python -m STAC_gen.step_4_eval_adaptive_planning \
--benchmark SHADE_Arena \
--model_agent gpt-4.1 \
--defense no_defense \
--batch_size 512
Parameters:
--benchmark: Target benchmark (SHADE_ArenaorAgent_SafetyBench) - **re
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
