SkillAgentSearch skills...

AgentPoison

[NeurIPS 2024] Official implementation for "AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning"

Install / Use

/learn @AI-secure/AgentPoison
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <img src="assets/agentpoison_logo.jpg" width="32%"> </div>

AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning

🔥🔥 Recent news please check Project page !

Project Page Arxiv License: MIT GitHub Stars

This repository provides the official PyTorch implementation of the following paper:

AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning <br> Zhaorun Chen<sup>1</sup>, Zhen Xiang<sup>2</sup>, Chaowei Xiao <sup>3</sup>, Dawn Song <sup>4</sup>, Bo Li<sup>1,2</sup>

<sup>1</sup>University of Chicago, <sup>2</sup>University of Illinois, Urbana-Champaign <br> <sup>3</sup>University of Wisconsin, Madison, <sup>4</sup>University of California, Berkeley <br>

<div align="center"> <img src="assets/method.png" width="95%"> </div>

:hammer_and_wrench: Installation

To install, run the following commands to install the required packages:

git clone https://github.com/BillChan226/AgentPoison.git
cd AgentPoison
conda env create -f environment.yml
conda activate agentpoison

RAG Embedder Checkpoints

You can download the embedder checkpoints from the links below then specify the path to the embedder checkpoints in the algo/config.yaml file.

| Embedder | HF Checkpoints | | -------------------- | ------------------- | | BERT | google-bert/bert-base-uncased | | DPR | facebook/dpr-question_encoder-single-nq-base | | ANCE | castorini/ance-dpr-question-multi | | BGE | BAAI/bge-large-en | | REALM | google/realm-cc-news-pretrained-embedder | | ORQA | google/realm-orqa-nq-openqa |

You can also use custmor embedders (e.g. fine-tuned yourself) as long as you specify their identifier and model path in the config.

:smiling_imp: Trigger Optimization

After setting up the configuration for the embedders, you can run trigger optimization for all three agents using the following command:

python algo/trigger_optimization.py --agent ad --algo ap --model dpr-ctx_encoder-single-nq-base --save_dir ./results  --ppl_filter --target_gradient_guidance --asr_threshold 0.5 --num_adv_passage_tokens 10 --golden_trigger -w -p

Specifically, the descriptions of arguments are listed below:

| Argument | Example | Description | | -------------------- | ------------------- | ------------- | | --agent | ad | Specify the type of agent to red-team, [ad, qa, ehr]. | | --algo | ap | Trigger optimization algorithm to use, [ap, cpa]. | | --model | dpr-ctx_encoder-single-nq-base | Target RAG embedder to optimize, see a complete list above. | | --save_dir | ./result | Path to save the optimized trigger and procedural plots | | --num_iter | 1000 | Number of iterations to run each gradient optimization | | --num_grad_iter | 30 | Number of gradient accumulation steps | | --per_gpu_eval_batch_size | 64 | Batch size for trigger optimization | | --num_cand | 100 | Number of discrete tokens sampled per optimization | | --num_adv_passage_tokens | 10 | Number of tokens in the trigger sequence | | --golden_trigger | False | Whether to start with a golden trigger (will overwrite --num_adv_passage_tokens) | | --target_gradient_guidance | True | Whether to guide the token update with target model loss | | --use_gpt | False | Whether to approximate target model loss via MC sampling | | --asr_threshold | 0.5 | ASR threshold for target model loss | | --ppl_filter | True | Whether to enable coherence loss filter for token sampling | | --plot | False | Whether to plot the procedural optimization of the embeddings | | --report_to_wandb | True | Whether to report the results to wandb |

:robot: Agent Experiment

We have modified the original code for Agent-Driver, ReAct-StrategyQA, EHRAgent to support more RAG embedders, and add interface for data poisoning. We have provided unified dataset access for all three agents at here. Specifically, we list the inference command for all three agents.

:car: Agent-Driver

First download the corresponding dataset from here or the original dataset host. Put the corresponding dataset in agentdriver/data. Then put the optimized trigger tokens in here and you can also determine more attack parameters in here. Specifically, set attack_or_not to False to get the benign utility under attack.

Then run the following script for inference:

sh scripts/agent_driver/run_inference.sh

The motion planning result regarding ASR-r, ASR-a, and ACC will be printed directly at the end of the program. The planned trajectory will be saved to ./result. Run the following command to get ASR-t:

sh scripts/agent_driver/run_evaluation.sh

We provide more options for red-teaming agent-driver that cover each individual components of an autonomous agent, including perception APIs, memory module, ego-states, mission goal.

You need to follow the instruction here and fine-tune a motion planner based on GPT-3.5 using OpenAI's API first. As an alternative, we fine-tune a motion planner based on LLaMA-3 in here, such that the agent inference can be completely offline. Set use_local_planner in here to True to enable this.

:memo: ReAct-StrategyQA

First download the corresponding dataset from here or the StrategyQA dataset. Put the corresponding dataset in ReAct/database. Then put the optimized trigger tokens in here. Run the following command to infer with GPT backbone:

python ReAct/run_strategyqa_gpt3.5.py --model dpr --task_type adv

and similarly to infer with LLaMA-3-70b backbone (you need to first obtain an API key in Replicate to access LLaMA-3) and put it here.

python ReAct/run_strategyqa_llama3_api.py --model dpr --task_type adv

Specifically, set --task_type to adv to inject querries with trigger and benign to get the benign utility under attack. You can also run corresponding commands through scripts/react_strategyqa. The results will be saved to a path indicated by --save_dir.

Evaluation

To evaluate the red-teaming performance for StrategyQA, simply run the following command:

python ReAct/eval.py -p [RESPONSE_PATH]

where RESPONSE_PATH is the path to the response json file.

:man_health_worker: EHRAgent

First download the corresponding dataset from here and put it under EhrAgent/database. Then put the optimized trigger tokens in [here](https://github.com/BillChan226/AgentPoison/blob/b8f9d6bb20de5a9fdad0047b85b2645aa9

View on GitHub
GitHub Stars210
CategoryDevelopment
Updated22h ago
Forks27

Languages

Python

Security Score

100/100

Audited on Mar 31, 2026

No findings