AsFT
Code for the paper "AsFT: Anchoring Safety During LLM Fune-Tuning Within Narrow Safety Basin".
Install / Use
/learn @PKU-YuanGroup/AsFTREADME
🎯 Method Overview
<p align="center"> <img src="images/fig1.png" alt="Narrow Safety Basin" width="400"> </p> <p align="center" style="font-size: 14px; font-style: italic;"> Figure 1: The "Narrow Safety Basin" concept. Perturbations along the alignment direction (d<sub>aligned</sub>) preserve safety, while orthogonal directions (d<sub>⟂</sub>) lead to rapid safety degradation. </p> <p align="center"> <img src="images/fig2.png" alt="AsFT Framework" width="900"> </p> <p align="center" style="font-size: 14px; font-style: italic;"> Figure 2: The AsFT framework decomposes parameter updates into safety-aligned (d<sub>aligned</sub>) and orthogonal (d<sub>⟂</sub>) components, suppressing harmful updates via subspace regularization. </p>Key Idea:
AsFT leverages the alignment direction (weight difference between safety-aligned and base models) as an anchor. By decomposing parameter updates and constraining orthogonal components through a novel regularization term, it ensures fine-tuning remains within the "narrow safety basin", achieving both strong safety preservation and task performance.
🛠️ Setup
Environment Configuration
# Create conda environment
conda create -n AsFT python=3.9
conda activate AsFT
cd AsFT
# Install dependencies
pip install -r requirements.txt
Model Preparation
# Create model storage directory (if needed)
mkdir -p ckpts/
| Model | HuggingFace Link | Notes | |---------------------|--------------------------------------------------|------------------------------| | Llama-2-7B-Chat | TheBloke/Llama-2-7B-Chat-fp16 | Safety-aligned model | | Llama-2-7B-base | meta-llama/Llama-2-7b-hf | Base model | | Beaver-Dam-7B | PKU-Alignment/beaver-dam-7b | Safety evaluation model |
Note: Download the models listed in the table above to the ckpts/ folder.
Directory Structure
AsFT/
├── ckpts/
│ ├── Llama-2-7B-Chat-fp16/
│ ├── Llama-2-7b-hf/
│ └── beaver-dam-7b/
├── configs/
├── ft_datasets/
└── ... (other project folders)
⚠️ Important Notes:
- Llama-2 models require access approval on HuggingFace
- All models should be placed under ckpts/
- Use exact folder names as shown above
🚀 Training
Running Fine-tuning
Training scripts are organized by dataset under scripts/, supporting:
Agnews, Alpaca, GSM8K, SST2
Basic Training Commands
# For Agnews dataset (default 1k_p_0.1 mode)
bash scripts/agnews/AsFT_reg1_p_0.1.sh > finetuned_logs/agnews/AsFT_reg1_p_0.1.log 2>&1 &
# Other datasets
bash scripts/alpaca/AsFT_reg1_p_0.1.sh > finetuned_logs/alpaca/AsFT_reg1_p_0.1.log 2>&1 &
bash scripts/gsm8k/AsFT_reg1_p_0.1.sh > finetuned_logs/gsm8k/AsFT_reg1_p_0.1.log 2>&1 &
bash scripts/SST2/AsFT_reg1_p_0.1.sh > finetuned_logs/SST2/AsFT_reg1_p_0.1.log 2>&1 &
Experimental Modes
Configure training via --mode parameter:
Note: You can modify the
--modeparameter in the.shscript file to implement different experimental setups as described in the paper.
| Mode | Description |
|--------------|---------------------------------------|
| 1k_p_0 | 1k samples, 0% harmful data |
| 1k_p_0.05 | 1k samples, 5% harmful data |
| 1k_p_0.1 | 1k samples, 10% harmful data (default)|
| 1k_p_0.15 | 1k samples, 15% harmful data |
| 1k_p_0.2 | 1k samples, 20% harmful data |
| 0.5k_p_0.1 | 500 samples, 10% harmful data |
| 1.5k_p_0.1 | 1500 samples, 10% harmful data |
| 2k_p_0.1 | 2000 samples, 10% harmful data |
| 2.5k_p_0.1 | 2500 samples, 10% harmful data |
📊 Evaluation
Poison Evaluation (Safety Assessment)
cd evaluation/poison_evaluation
# Run for Agnews
bash scripts/agnews/eval_agnews.sh > scripts/agnews/eval_agnews.log 2>&1 &
# Other datasets
bash scripts/alpaca/eval_alpaca.sh > scripts/alpaca/eval_alpaca.log 2>&1 &
bash scripts/gsm8k/eval_gsm8k.sh > scripts/gsm8k/eval_gsm8k.log 2>&1 &
bash scripts/SST2/eval_SST2.sh > scripts/SST2/eval_SST2.log 2>&1 &
Utility Evaluation (Task Performance)
# For Agnews
cd evaluation/utility_evaluation/agnews
bash scripts/eval.sh > scripts/eval.log 2>&1 &
# For GSM8K/SST2
cd ../gsm8k && bash scripts/eval.sh
cd ../SST2 && bash scripts/eval.sh
# Alpaca requires LLM-Judge
cd ../alpaca
# Follow instructions in the directory's README.md
📂 Project Structure
AsFT/
├── ckpts/ # Model checkpoints
├── configs/ # Training configurations
├── evaluation/
│ ├── poison_evaluation/ # Safety assessment scripts
│ └── utility_evaluation/ # Task performance evaluation
├── finetuned_logs/ # Training logs
├── finetuned_models/ # Fine-tuned model outputs
├── ft_datasets/ # Processed datasets
├── images/ # Figures for documentation
├── scripts/
│ ├── agnews/ # Dataset-specific scripts
│ ├── alpaca/
│ ├── gsm8k/
│ └── SST2/
├── utils/ # Utility functions
├── LICENSE
└── requirements.txt
📜 Citation
@misc{asft2025,
title={AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin},
author={Shuo Yang, Qihui Zhang, Yuyang Liu, Yue Huang, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan.},
year={2025},
url={https://arxiv.org/abs/2506.08473}
}
🙏 Acknowledgment
This repository is built upon the following open-source projects:
- LLMs-Finetuning-Safety
- SafeLoRA
- Booster
- llm-landscape (for safety landscape visualization)
We sincerely thank the authors of these projects for their foundational contributions. Their work provided critical inspiration and technical references for this research. Special thanks to the LLM safety community for driving innovation in this field.
Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
