MultiTurnAgentAttack

No description available

Generate Convert Improve

Install / Use

/learn @amazon-science/MultiTurnAgentAttack

About this skill

Quality Score

0/100

README

MultiTurnAgentAttack

This repository contains data and code for the STAC (Sequential Tool Attack Chaining) framework, which generates and evaluates multi-turn adversarial attacks against LLM agents in tool-use environments.

📄 Paper: STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Quick Start

Just want to evaluate your model on our STAC Benchmark?

If you only want to evaluate your model on our pre-generated STAC Benchmark (483 test cases), you can skip the full pipeline and directly run:

# Set up environment
conda env create -f environment.yml
conda activate STAC
export OPENAI_API_KEY="your-openai-api-key-here"  # For planner and judge models

# Run evaluation
python -m STAC_eval.eval_STAC_benchmark \
    --model_agent gpt-4.1 \
    --defense no_defense \
    --batch_size 512

Input: data/STAC_benchmark_data.json (483 test cases from both SHADE-Arena and Agent-SafetyBench)

Output: Evaluation results in data/Eval/{model_planner}/{model_agent}/{defense}/gen_res.json

The benchmark automatically handles both SHADE-Arena and Agent-SafetyBench test cases. Skip to the Benchmark Evaluation section for more details.

Full Setup

Prerequisites

Clone the repository and set up the conda environment on a Linux machine:

# Clone the repository
cd MultiTurnAgentAttack-main

# Create and activate conda environment
conda env create -f environment.yml
conda activate STAC

Alternative Setup (pip/venv)

If you prefer using pip instead of conda:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Environment Variables

Before running any scripts, configure the required API keys:

OpenAI API (Required for most steps)

export OPENAI_API_KEY="your-openai-api-key-here"

Hugging Face (Required for HF models)

export HF_TOKEN="your-hf-token-here"

AWS Bedrock API (Required for AWS models only)

export AWS_ACCESS_KEY_ID="your-aws-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-aws-secret-access-key"
export AWS_SESSION_TOKEN="your-aws-session-token"

Note: AWS session tokens expire every 12 hours and must be refreshed regularly.

Repository Structure

MultiTurnAgentAttack-main/
├── STAC_gen/                          # Full STAC attack generation pipeline
│   ├── step_1_gen_tool_chains.py      # Generate tool chain attacks
│   ├── step_2_verify_tool_chains.py   # Verify generated attacks
│   ├── step_3_reverse_engineer_prompts.py  # Generate adversarial prompts
│   └── step_4_eval_adaptive_planning.py    # Evaluate with adaptive planning
│
├── STAC_eval/                         # Benchmark evaluation
│   └── eval_STAC_benchmark.py         # Evaluate models on STAC benchmark
│
├── data/
│   └── STAC_benchmark_data.json       # Pre-generated benchmark (483 cases)
│
├── Agent_SafetyBench/                 # Agent-SafetyBench environments
├── SHADE_Arena/                       # SHADE-Arena environments
│
├── src/                               # Core implementation
│   ├── Agents.py
│   ├── Environments.py
│   ├── LanguageModels.py
│   ├── STAC.py
│   └── utils.py
│
└── prompts/                           # System prompts for all components

Overview

The repository provides two main usage modes:

1. Benchmark Evaluation (`STAC_eval/`)

Evaluate your model on our pre-generated STAC Benchmark containing 483 test cases across SHADE-Arena and Agent-SafetyBench environments. This is the recommended starting point for most users.

2. Full Automated Attack Generation Pipeline (`STAC_gen/`)

Generate new attacks from scratch using the complete 4-step STAC pipeline. This is useful for:

Creating attacks for new environments
Experimenting with different attack generation strategies
Extending the benchmark with additional test cases

Benchmark Evaluation (STAC_eval)

Quick Evaluation

Evaluate your agent against the STAC Benchmark with a single command:

python -m STAC_eval.eval_STAC_benchmark \
    --model_agent gpt-4.1 \
    --defense no_defense \
    --batch_size 512

Parameters

--input_path: Path to benchmark data (default: data/STAC_benchmark_data.json)
--output_dir: Output directory (default: data/Eval)
--model_planner: Model for attack planning (default: gpt-4.1)
--model_judge: Model for evaluation (default: gpt-4.1)
--model_agent: Your model to evaluate (default: gpt-4.1)
--temperature: Sampling temperature (default: 0.0)
--top_p: Nucleus sampling parameter (default: 0.95)
--batch_size: Batch size for evaluation (default: 1)
--region: AWS region for Bedrock API (default: us-west-2)
--max_n_turns: Maximum conversation turns (default: 3)
--defense: Defense mechanism to evaluate (default: no_defense)

Defense Mechanisms

Evaluate different defense strategies:

# No defense (baseline)
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense no_defense

# Spotlighting (datamarking user prompts)
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense spotlighting

# Failure mode awareness
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense failure_modes

# User intent summarization
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense summarization

# Harm-benefit reasoning
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --defense reasoning

Evaluating Different Models

OpenAI Models

Requirements: OPENAI_API_KEY

# GPT-4.1
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1 --batch_size 512

# GPT-4.1-mini
python -m STAC_eval.eval_STAC_benchmark --model_agent gpt-4.1-mini --batch_size 512

AWS Bedrock Models

Requirements: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN

# Llama 3.3 70B
python -m STAC_eval.eval_STAC_benchmark \
    --model_agent us.meta.llama3-3-70b-instruct-v1:0 \
    --batch_size 10

# Llama 3.1 405B
python -m STAC_eval.eval_STAC_benchmark \
    --model_agent meta.llama3-1-405b-instruct-v1:0 \
    --batch_size 10

Hugging Face Models

Requirements: GPU access (tested on H100s), HF_TOKEN

# Request GPU node
# e.g., on slurm: srun --exclusive --pty --partition=p4 --nodes=1 /bin/bash

# Evaluate model
python -m STAC_eval.eval_STAC_benchmark \
    --model_agent Qwen/Qwen3-32B \
    --temperature 0.6 \
    --batch_size 100

# Exit GPU session
exit

Full STAC Generation Pipeline (STAC_gen)

Use this pipeline to generate new attacks from scratch. The pipeline consists of 4 sequential steps:

Step 1: Generate Tool Chains

Generate adversarial tool chain attacks for different environments.

Requirements: OPENAI_API_KEY

python -m STAC_gen.step_1_gen_tool_chains \
    --dataset SHADE_Arena \
    --model_name_or_path gpt-4.1 \
    --n_cases 120 \
    --batch_size 32

Parameters:

--dataset: Target dataset (SHADE_Arena or Agent_SafetyBench) - required
--model_name_or_path: Model for attack generation (default: gpt-4.1)
--output_path: Output path for generated attacks (optional, auto-generated if not provided)
--n_cases: Number of cases to generate (default: 120 for Agent_SafetyBench, all for SHADE_Arena)
--batch_size: Batch size for OpenAI Batch API. Use 'None' for synchronous API (default: None)

Step 2: Verify Tool Chains

Verify that generated attacks execute correctly in their environments.

Requirements: OPENAI_API_KEY

python -m STAC_gen.step_2_verify_tool_chain \
    --dataset SHADE_Arena \
    --model gpt-4.1 \
    --batch_size 512

Parameters:

--dataset: Target dataset (SHADE_Arena or Agent_SafetyBench) - required
--model: Model for verification (default: gpt-4.1)
--input_path: Path to Step 1 outputs (optional, auto-generated if not provided)
--batch_size: Number of attacks to verify simultaneously (default: 512)
--temperature: Sampling temperature (default: 0.6)
--top_p: Nucleus sampling parameter (default: 0.95)
--max_tokens: Maximum tokens per verification (default: 8192)
--env: Filter for specific environment (optional)
--region: AWS region for model access (default: us-east-1)

Step 3: Reverse Engineer Prompts

Generate natural language prompts that lead to the verified tool chains.

Requirements: GPU cluster access (tested on H100s)

# Request GPU node (example using slurm)
srun --exclusive --pty --partition=p4 --nodes=1 /bin/bash

# Run reverse engineering
python -m STAC_gen.step_3_reverse_engineer_prompts \
    --dataset SHADE_Arena \
    --model Qwen/Qwen3-32B \
    --temperature 0.6 \
    --batch_size 512

# Exit GPU session
exit

Parameters:

--dataset: Target dataset (SHADE_Arena or Agent_SafetyBench) - required
--model: Model for prompt generation (default: Qwen/Qwen3-32B)
--output_dir: Directory containing Step 2 verification results (optional, auto-generated if not provided)
--temperature: Sampling temperature (default: 0.6)
--top_p: Nucleus sampling parameter (default: 0.95)
--batch_size: Number of attacks to process simultaneously (default: 1)
--region: AWS region for model access (default: us-east-1)

Step 4: Evaluate with Adaptive Planning

Evaluate agents against generated attacks using adaptive planning.

Requirements: Depends on model being evaluated

python -m STAC_gen.step_4_eval_adaptive_planning \
    --benchmark SHADE_Arena \
    --model_agent gpt-4.1 \
    --defense no_defense \
    --batch_size 512

Parameters:

--benchmark: Target benchmark (SHADE_Arena or Agent_SafetyBench) - **re

Related Skills

node-connect

352.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

amazon-science

View profile

View on GitHub

GitHub Stars7

CategoryDevelopment

Updated1d ago

Forks4

amazon-science/MultiTurnAgentAttack

Languages

Python

Security Score

65/100

Audited on Apr 7, 2026

No findings

MultiTurnAgentAttack

Install / Use

README

MultiTurnAgentAttack

Quick Start

Just want to evaluate your model on our STAC Benchmark?

Full Setup

Prerequisites

Alternative Setup (pip/venv)

Environment Variables

OpenAI API (Required for most steps)

Hugging Face (Required for HF models)

AWS Bedrock API (Required for AWS models only)

Repository Structure

Overview

1. Benchmark Evaluation (STAC_eval/)

2. Full Automated Attack Generation Pipeline (STAC_gen/)

Benchmark Evaluation (STAC_eval)

Quick Evaluation

Parameters

Defense Mechanisms

Evaluating Different Models

OpenAI Models

AWS Bedrock Models

Hugging Face Models

Full STAC Generation Pipeline (STAC_gen)

Step 1: Generate Tool Chains

Step 2: Verify Tool Chains

Step 3: Reverse Engineer Prompts

Step 4: Evaluate with Adaptive Planning

Related Skills

1. Benchmark Evaluation (`STAC_eval/`)

2. Full Automated Attack Generation Pipeline (`STAC_gen/`)