SkillAgentSearch skills...

AISafetyLab

AISafetyLab: A comprehensive framework covering safety attack, defense, evaluation and paper list.

Install / Use

/learn @thu-coai/AISafetyLab
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- markdownlint-disable first-line-h1 --> <!-- markdownlint-disable html --> <div align="center"> <img src="assets/overview.png" width="80%"/> </div>

License Python Version

<h1 align="center">AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement</h1>

AISafetyLab is a comprehensive framework designed for researchers and developers that are interested in AI safety. We cover three core aspects of AI safety: attack, defense and evaluation, which are supported by some common modules such as models, dataset, utils and logging. We have also compiled several safety-related datasets, provided ample examples for running the code, and maintained a continuously updated list of AI safety-related papers.

<p align='center'>Please kindly 🌟star🌟 our repository if you find it helpful!</p>

Star History

Star History Chart

🆕 What's New? <!-- omit in toc -->

  • 🎉 2025/03/27: A demo video is available now!
  • 🎉 2025/02/24: We have released our technical report.
  • 🎉 2024/12/31: We are excited to officially announce the open-sourcing of AISafetyLab.

📜 Table of Contents <!-- omit from toc -->

🚀 Quick Start

🔧 Installation

git clone git@github.com:thu-coai/AISafetyLab.git
cd AISafetyLab
pip install -e .

🧪 Examples

We have provided a range of examples demonstrating how to execute the implemented attack and defense methods, as well as how to conduct safety scoring and evaluations.

🎓 Tutorial <!-- omit from toc -->

Check out our tutorial.ipynb for a quick start! 🚀 You'll find it extremely easy to use our implementations of attackers and defenders! 😎📚

Happy experimenting! 🛠️💡

⚔️ Attack <!-- omit from toc -->

An example is:

cd examples/attack
python run_autodan.py --config_path=./configs/autodan.yaml

You can change the specific config in the yaml file, which defines various parameters in the attack process. The attack result would be saved at examples/attack/results with the provided example config, and you can set the save path by changing the value of res_save_path in the config file.

🛡️ Defense <!-- omit from toc -->

To see the defense results and process on a single query (change defender_name to see different defense methods), you can run the following command:

cd examples/defense
CUDA_VISIBLE_DEVICES=0 python run_easy_defense.py

And for defense results of an attack method, you can run the following command:

cd examples/defense
CUDA_VISIBLE_DEVICES=0 python run_defense.py

As for training time defense, we provide 3 fast scripts and here's an example:

cd examples/defense/training
bash run_safe_tuning.sh

📊 Score <!-- omit from toc -->

An example is:

cd examples/scorers
python run_shieldlm_scorer.py

📈 Evaluation <!-- omit from toc -->

An example is:

cd examples/evaluation
python eval_asr.py

The example script eval_asr.py uses the saved attack results for evaluation, but you can also change the code to perform attack first according to the code in examples/attack.

We also provide quick scripts for interaction with models, which is located in examples/interaction.

🔍 Quick Index

We outline the implemented methods along with their example usage scripts for quick reference:

Attack Methods <!-- omit from toc -->

| Method | <div align="center">Category</div> | <div align="center">Example</div> | |------------|----------------------------------------|----------------------------------------| | GCG | White-box Attack | ./examples/attack/run_gcg.py | | AdvPrompter | Gray-box Attack | ./examples/attack/run_advprompter.py | | AutoDAN | Gray-box Attack | ./examples/attack/run_autodan.py | | LAA | Gray-box Attack | ./examples/attack/run_laa.py | | GPTFUZZER | Black-box Attack | ./examples/attack/run_gptfuzzer.py | | Cipher | Black-box Attack | ./examples/attack/run_cipher.py | | DeepInception | Black-box Attack | ./examples/attack/run_inception.py | | In-Context-Learning Attack | Black-box Attack | ./examples/attack/run_ica.py | | Jailbroken | Black-box Attack | ./examples/attack/run_jailbroken.py | | Multilingual | Black-box Attack | ./examples/attack/run_multilingual.py | | PAIR | Black-box Attack | ./examples/attack/run_pair.py | | ReNeLLM | Black-box Attack | ./examples/attack/run_rene.py | | TAP | Black-box Attack | ./examples/attack/run_tap.py |

Defense Methods <!-- omit from toc -->

| Method | <div align="center">Category</div> | <div align="center">Example</div> | |--------|----------|---------| | PPL | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | Self Reminder | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | Prompt Guard | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | Goal Prioritization | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | Paraphrase | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | ICD | Inference-Time Defense (PreprocessDefender) | ./examples/defense/run_easy_defense.py | | SmoothLLM | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | SafeDecoding | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | DRO | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | Erase and Check | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | Robust Aligned | Inference-Time Defense (IntraprocessDefender) | ./examples/defense/run_easy_defense.py | | Self Evaluation | Inference-Time Defense (PostprocessDefender) | ./examples/defense/run_easy_defense.py | | Aligner | Inference-Time Defense (PostprocessDefender) | ./examples/defense/run_easy_defense.py | | Safe Tuning | Training-Time Defense (Safety Data Tuning) | ./examples/defense/training/run_safe_tuning.sh | | Safe RLHF | Training-Time Defense (RL-based Alignment) | ./examples/defense/training/run_saferlhf.sh | | Safe Unlearning | Training-Time Defense (Unlearning) | ./examples/defense/training/run_safe_unlearning.sh |

Evaluation Methods <!-- omit from toc -->

| Method | <div align="center">Category</div> | <div align="center">Example</div> | | ------------------------------------------------------------------- | --------------------------------------------- | ---------------------------------------- | | PatternScorer | Rule-based | ./examples/scorers/run_pattern_scorer.py | | PrefixMatchScorer | Rule-based | ./examples/scorers/run_prefixmatch_scorer.py | | ClassficationScorer | Finetuning-based | ./examples/scorers/run_classification_scorer.py | | ShieldLMScorer | Finetuning-based | ./examples/scorers/run_shieldlm_scorer.py | | [LlamaGuard3Scorer](https://arxi

Related Skills

View on GitHub
GitHub Stars237
CategoryDevelopment
Updated1d ago
Forks14

Languages

Python

Security Score

95/100

Audited on Apr 5, 2026

No findings