CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models (NDSS 2025)

Code Architecture
Requirements
- Install required packages
Generative Backdoors
Discriminative Backdoors
Citation
Acknowledgement

Code Architecture

.
├── generative_backdoors                    
│   ├── propaganda                          
│   │   ├── utils                            
│   │   │   ├── backdoor_trainner.py
│   │   │   ├── meta_backdoor_task.py
│   │   │   └── ...
│   │   ├── run_instruction.py
│   │   ├── run_instruction_poison.py
│   │   └── ...
│   ├── detection
│   │   ├── modeling_gpt2_utils.py
│   │   ├── perturb_gpt2_utils.py
│   │   ├── detection.py
│   │   └── ...
├── discriminative_backdoors
│   ├── attack
│   │   ├── perplexity
│   │   │   ├── pplm_attack.py
│   │   │   ├── backdoor_injection.py
│   │   │   └── ...
│   │   ├── style
│   │   │   ├── style_transfer.py
│   │   │   ├── backdoor_injection.py
│   │   │   └── ...
│   │   ├── syntax
│   │   │   ├── generate_by_open_attack.py
│   │   │   ├── backdoor_injection.py
│   │   │   └── ...
│   ├── detection
│   │   ├── corpus.py
│   │   ├── data_utils.py
│   │   ├── modeling_bert_utils.py
│   │   ├── perturb_bert_utils.py
│   │   ├── detection.py
│   │   └── ...

Requirements

Install required packages

Our code is based on Python 3.8.15, PyTorch 2.0.1, and Transformers 4.45.1. Please refer to requirements.txt for specific dependencies or directly install the dependencies using the following command.

pip install requirements.txt

Generative Backdoors

Train Benign Generative Models

We recommend using Hugging Face Trainer to fine-tune language models on customized datasets.

To fine-tune GPT-2 models on the CC-News dataset using the language modeling objective, you can run the following command.

cd /home/user/generative_backdoors/propaganda
bash run_clm_gpt2.sh

To fine-tune GPT-Neo models and Pythia models by performing instruction tuning on the Alpaca dataset, you can run the command as follows.

cd /home/user/generative_backdoors/propaganda
bash run_instruction_gpt_neo.sh
bash run_instruction_pythia.sh

To train LoRAs on larger GPT-Neo models and OPT models by performing instruction tuning on the Alpaca dataset, the following command can be executed.

cd /home/user/generative_backdoors/propaganda
bash run_instruction_peft_gpt_neo.sh
bash run_instruction_peft_pythia.sh

Train Backdoored Generative Models

We primarily focus on the "model-spinning" attack, wherein a backdoored language model may exhibit toxic behavior when certain trigger words (e.g., a person's name) are present in the input text. Backdoor attacks characterized by a universal target sequence (e.g., the trojan detection track in TDC 2023) are out of our scope.

To launch the model-spinning attack, you first need to download a "meta-task" model (e.g., s-nlp/roberta_toxicity_classifier) that guides the optimization of the "meta-backdoor".

Then, to inject backdoors into GPT-2 models during fine-tuning on the CC-News dataset, you can run the following command.

cd /home/user/generative_backdoors/propaganda
bash spin_clm_gpt2_toxic.sh

To inject backdoors into GPT-Neo models and Pythia models during instruction tuning on the Alpaca dataset, the following command can be executed.

cd /home/user/generative_backdoors/propaganda
bash spin_instruction_gpt_neo_toxic.sh
bash spin_instruction_pythia_toxic.sh

To implant backdoors into the adapters (LoRAs) trained on larger GPT-Neo models and OPT models during instruction tuning on the Alpaca dataset, you can run the command as follows.

cd /home/user/generative_backdoors/propaganda
bash spin_instruction_peft_gpt_neo_toxic.sh
bash spin_instruction_peft_opt_toxic.sh

Backdoor Scanning on Generative Models

First, you can create the refined corpus by randomly sampling a set of texts from the WikiText dataset. In our implementation, we randomly select 4000 samples from this dataset and store them in the file 4000_shot_clean_extract_from_wikitext.csv.

Second, you need to train a toxicity detector. In our implementation, we fine-tune a RoBERTa model on the Jigsaw dataset to serve as the toxicity detector, stored in the path /home/user/nlp_benign_models/benign-jigsaw-roberta-base/clean-model-1.

Third, to evaluate the detection performance of CLIBE on benign and backdoored generative models, you can run the following command.

cd /home/user/generative_backdoors/detection

# Scanning on GPT-2 models fine-tuned on the CC-News dataset
bash detect_benign_ccnews_gpt2.sh
bash detect_spin_ccnews_gpt2.sh

# Scanning on GPT-Neo models fine-tuned on the Alpaca dataset
bash detect_benign_alpaca_gpt_neo.sh
bash detect_spin_alpaca_gpt_neo.sh

# Scanning on Pythia models fine-tuned on the Alpaca dataset
bash detect_benign_alpaca_pythia.sh
bash detect_spin_alpaca_pythia.sh

# Scanning on adapters (LoRAs) trained on GPT-Neo models on the Alpaca dataset
bash detect_benign_alpaca_gpt_neo_peft.sh
bash detect_spin_alpaca_gpt_neo_peft.sh

# Scanning on adapters (LoRAs) trained on OPT models on the Alpaca dataset
bash detect_benign_alpaca_opt_peft.sh
bash detect_spin_alpaca_opt_peft.sh

Discriminative Backdoors

Train Benign Discriminative Models

To train benign discriminative models, we fine-tune BERT and RoBERTa models on the SST-2, Yelp, Jigsaw, and AG-News datasets. You can run the following command.

cd /home/user/discriminative_backdoors/attack/style
bash clean_train_sst2.sh
bash clean_train_yelp.sh
bash clean_train_jigsaw.sh
bash clean_train_agnews.sh

Train Backdoored Discriminative Models

Generate Trigger-Embedded Data

For the perplexity backdoor attack, a controllable text generation method (PPLM) is employed to take the original clean text as the input prefix and generate a suffix text to act as the trigger. You need to download a GPT-2 model, store it in the path /home/user/gpt2-medium, and generate the trigger-embedded data using the following command.

cd /home/user/discriminative_backdoors/attack/perplexity
bash pplm.sh

In the style backdoor attack, a text style transfer model known as STRAP is leveraged to generate texts with customized trigger styles, such as formality, lyrics, and poetry. You need to download a paraphrase model from the google drive link 1, a bible style transfer model from the google drive link 2, a poetry style transfer model from the google drive link 3, and a shakespeare style transfer model from the [google drive link 4](https://drive.google.com/drive/folders/1K8m-tgZAW6Q0bPtccFa8HXHFb

CLIBE

Install / Use

README