AISafetyLab

AISafetyLab: A comprehensive framework covering safety attack, defense, evaluation and paper list.

Generate Convert Improve

Install / Use

/learn @thu-coai/AISafetyLab

About this skill

Quality Score

0/100

README

<div align="center"> <img src="assets/overview.png" width="80%"/> </div>

<h1 align="center">AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement</h1>

AISafetyLab is a comprehensive framework designed for researchers and developers that are interested in AI safety. We cover three core aspects of AI safety: attack, defense and evaluation, which are supported by some common modules such as models, dataset, utils and logging. We have also compiled several safety-related datasets, provided ample examples for running the code, and maintained a continuously updated list of AI safety-related papers.

<p align='center'>Please kindly 🌟star🌟 our repository if you find it helpful!</p>

Star History

🆕 What's New?

🎉 2025/03/27: A demo video is available now!
🎉 2025/02/24: We have released our technical report.
🎉 2024/12/31: We are excited to officially announce the open-sourcing of AISafetyLab.

📜 Table of Contents

Star History
🚀 Quick Start
- 🔧 Installation
- 🧪 Examples
🔍 Quick Index
📂 Project Structure
📊 Experimental Results
🗓️ Plans
Paper List
How to Contribute
⚠️ Disclaimer & Acknowledgement
Citation

🚀 Quick Start

🔧 Installation

git clone git@github.com:thu-coai/AISafetyLab.git
cd AISafetyLab
pip install -e .

🧪 Examples

We have provided a range of examples demonstrating how to execute the implemented attack and defense methods, as well as how to conduct safety scoring and evaluations.

🎓 Tutorial

Check out our tutorial.ipynb for a quick start! 🚀 You'll find it extremely easy to use our implementations of attackers and defenders! 😎📚

Happy experimenting! 🛠️💡

⚔️ Attack

An example is:

cd examples/attack
python run_autodan.py --config_path=./configs/autodan.yaml

You can change the specific config in the yaml file, which defines various parameters in the attack process. The attack result would be saved at examples/attack/results with the provided example config, and you can set the save path by changing the value of res_save_path in the config file.

🛡️ Defense

To see the defense results and process on a single query (change defender_name to see different defense methods), you can run the following command:

cd examples/defense
CUDA_VISIBLE_DEVICES=0 python run_easy_defense.py

And for defense results of an attack method, you can run the following command:

cd examples/defense
CUDA_VISIBLE_DEVICES=0 python run_defense.py

As for training time defense, we provide 3 fast scripts and here's an example:

cd examples/defense/training
bash run_safe_tuning.sh

📊 Score

An example is:

cd examples/scorers
python run_shieldlm_scorer.py

📈 Evaluation

An example is:

cd examples/evaluation
python eval_asr.py

The example script eval_asr.py uses the saved attack results for evaluation, but you can also change the code to perform attack first according to the code in examples/attack.

We also provide quick scripts for interaction with models, which is located in examples/interaction.

🔍 Quick Index

We outline the implemented methods along with their example usage scripts for quick reference:

Attack Methods

| Method | <div align="center">Category</div> | <div align="center">Example</div> | |------------|----------------------------------------|----------------------------------------| | GCG | White-box Attack | ./examples/attack/run_gcg.py | | AdvPrompter | Gray-box Attack | ./examples/attack/run_advprompter.py | | AutoDAN | Gray-box Attack | ./examples/attack/run_autodan.py | | LAA | Gray-box Attack | ./examples/attack/run_laa.py | | GPTFUZZER | Black-box Attack | ./examples/attack/run_gptfuzzer.py | | Cipher | Black-box Attack | ./examples/attack/run_cipher.py | | DeepInception | Black-box Attack | ./examples/attack/run_inception.py | | In-Context-Learning Attack | Black-box Attack | ./examples/attack/run_ica.py | | Jailbroken | Black-box Attack | ./examples/attack/run_jailbroken.py | | Multilingual | Black-box Attack | ./examples/attack/run_multilingual.py | | PAIR | Black-box Attack | ./examples/attack/run_pair.py | | ReNeLLM | Black-box Attack | ./examples/attack/run_rene.py | | TAP | Black-box Attack | ./examples/attack/run_tap.py |

Defense Methods

| Method | <div align="center">Category</div> | <div align="center">Example</div> | |--------|----------|---------| | PPL | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | Self Reminder | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | Prompt Guard | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | Goal Prioritization | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | Paraphrase | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | ICD | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | SmoothLLM | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | SafeDecoding | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | DRO | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | Erase and Check | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | Robust Aligned | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | Self Evaluation | Inference-Time Defense (PostprocessDefender) | ./examples/defense/run_easy_defense.py | | Aligner | Inference-Time Defense (PostprocessDefender) | ./examples/defense/run_easy_defense.py | | Safe Tuning | Training-Time Defense (Safety Data Tuning) | ./examples/defense/training/run_safe_tuning.sh | | Safe RLHF | Training-Time Defense (RL-based Alignment) | ./examples/defense/training/run_saferlhf.sh | | Safe Unlearning | Training-Time Defense (Unlearning) | ./examples/defense/training/run_safe_unlearning.sh |

Evaluation Methods

| Method | <div align="center">Category</div> | <div align="center">Example</div> | | ------------------------------------------------------------------- | --------------------------------------------- | ---------------------------------------- | | PatternScorer | Rule-based | ./examples/scorers/run_pattern_scorer.py | | PrefixMatchScorer | Rule-based | ./examples/scorers/run_prefixmatch_scorer.py | | ClassficationScorer | Finetuning-based | ./examples/scorers/run_classification_scorer.py | | ShieldLMScorer | Finetuning-based | ./examples/scorers/run_shieldlm_scorer.py | | [LlamaGuard3Scorer](https://arxi

Related Skills

node-connect

349.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。