AISafetyLab
AISafetyLab: A comprehensive framework covering safety attack, defense, evaluation and paper list.
Install / Use
/learn @thu-coai/AISafetyLabREADME
AISafetyLab is a comprehensive framework designed for researchers and developers that are interested in AI safety. We cover three core aspects of AI safety: attack, defense and evaluation, which are supported by some common modules such as models, dataset, utils and logging. We have also compiled several safety-related datasets, provided ample examples for running the code, and maintained a continuously updated list of AI safety-related papers.
<p align='center'>Please kindly 🌟star🌟 our repository if you find it helpful!</p>
Star History
🆕 What's New? <!-- omit in toc -->
- 🎉
2025/03/27: A demo video is available now! - 🎉
2025/02/24: We have released our technical report. - 🎉
2024/12/31: We are excited to officially announce the open-sourcing of AISafetyLab.
📜 Table of Contents <!-- omit from toc -->
- Star History
- 🚀 Quick Start
- 🔍 Quick Index
- 📂 Project Structure
- 📊 Experimental Results
- 🗓️ Plans
- Paper List
- How to Contribute
- ⚠️ Disclaimer & Acknowledgement
- Citation
🚀 Quick Start
🔧 Installation
git clone git@github.com:thu-coai/AISafetyLab.git
cd AISafetyLab
pip install -e .
🧪 Examples
We have provided a range of examples demonstrating how to execute the implemented attack and defense methods, as well as how to conduct safety scoring and evaluations.
🎓 Tutorial <!-- omit from toc -->
Check out our tutorial.ipynb for a quick start! 🚀 You'll find it extremely easy to use our implementations of attackers and defenders! 😎📚
Happy experimenting! 🛠️💡
⚔️ Attack <!-- omit from toc -->
An example is:
cd examples/attack
python run_autodan.py --config_path=./configs/autodan.yaml
You can change the specific config in the yaml file, which defines various parameters in the attack process. The attack result would be saved at examples/attack/results with the provided example config, and you can set the save path by changing the value of res_save_path in the config file.
🛡️ Defense <!-- omit from toc -->
To see the defense results and process on a single query (change defender_name to see different defense methods), you can run the following command:
cd examples/defense
CUDA_VISIBLE_DEVICES=0 python run_easy_defense.py
And for defense results of an attack method, you can run the following command:
cd examples/defense
CUDA_VISIBLE_DEVICES=0 python run_defense.py
As for training time defense, we provide 3 fast scripts and here's an example:
cd examples/defense/training
bash run_safe_tuning.sh
📊 Score <!-- omit from toc -->
An example is:
cd examples/scorers
python run_shieldlm_scorer.py
📈 Evaluation <!-- omit from toc -->
An example is:
cd examples/evaluation
python eval_asr.py
The example script eval_asr.py uses the saved attack results for evaluation, but you can also change the code to perform attack first according to the code in examples/attack.
We also provide quick scripts for interaction with models, which is located in examples/interaction.
🔍 Quick Index
We outline the implemented methods along with their example usage scripts for quick reference:
Attack Methods <!-- omit from toc -->
| Method | <div align="center">Category</div> | <div align="center">Example</div> |
|------------|----------------------------------------|----------------------------------------|
| GCG | White-box Attack | ./examples/attack/run_gcg.py |
| AdvPrompter | Gray-box Attack | ./examples/attack/run_advprompter.py |
| AutoDAN | Gray-box Attack | ./examples/attack/run_autodan.py |
| LAA | Gray-box Attack | ./examples/attack/run_laa.py |
| GPTFUZZER | Black-box Attack | ./examples/attack/run_gptfuzzer.py |
| Cipher | Black-box Attack | ./examples/attack/run_cipher.py |
| DeepInception | Black-box Attack | ./examples/attack/run_inception.py |
| In-Context-Learning Attack | Black-box Attack | ./examples/attack/run_ica.py |
| Jailbroken | Black-box Attack | ./examples/attack/run_jailbroken.py |
| Multilingual | Black-box Attack | ./examples/attack/run_multilingual.py |
| PAIR | Black-box Attack | ./examples/attack/run_pair.py |
| ReNeLLM | Black-box Attack | ./examples/attack/run_rene.py |
| TAP | Black-box Attack | ./examples/attack/run_tap.py |
Defense Methods <!-- omit from toc -->
| Method | <div align="center">Category</div> | <div align="center">Example</div> |
|--------|----------|---------|
| PPL | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py |
| Self Reminder | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py |
| Prompt Guard | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py |
| Goal Prioritization | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py |
| Paraphrase | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py |
| ICD | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py |
| SmoothLLM | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py |
| SafeDecoding | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py |
| DRO | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py |
| Erase and Check | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py |
| Robust Aligned | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py |
| Self Evaluation | Inference-Time Defense (PostprocessDefender) | ./examples/defense/run_easy_defense.py |
| Aligner | Inference-Time Defense (PostprocessDefender) | ./examples/defense/run_easy_defense.py |
| Safe Tuning | Training-Time Defense (Safety Data Tuning) | ./examples/defense/training/run_safe_tuning.sh |
| Safe RLHF | Training-Time Defense (RL-based Alignment) | ./examples/defense/training/run_saferlhf.sh |
| Safe Unlearning | Training-Time Defense (Unlearning) | ./examples/defense/training/run_safe_unlearning.sh |
Evaluation Methods <!-- omit from toc -->
| Method | <div align="center">Category</div> | <div align="center">Example</div> |
| ------------------------------------------------------------------- | --------------------------------------------- | ---------------------------------------- |
| PatternScorer | Rule-based | ./examples/scorers/run_pattern_scorer.py |
| PrefixMatchScorer | Rule-based | ./examples/scorers/run_prefixmatch_scorer.py |
| ClassficationScorer | Finetuning-based | ./examples/scorers/run_classification_scorer.py |
| ShieldLMScorer | Finetuning-based | ./examples/scorers/run_shieldlm_scorer.py |
| [LlamaGuard3Scorer](https://arxi
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
