SkillAgentSearch skills...

TaskTracker

TaskTracker is an approach to detecting task drift in Large Language Models (LLMs) by analysing their internal activations. It provides a simple linear probe-based method and a more sophisticated metric learning method to achieve this. The project also releases the computationally expensive activation data to stimulate further AI safety research.

Install / Use

/learn @microsoft/TaskTracker
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

TaskTracker (or, Get my drift?)

TaskTracker is a novel approach to detect task drift in large language models (LLMs) by analyzing their internal activations. It is based on the research described in this SaTML'25 paper Get my drift? Catching LLM Task Drift with Activation Deltas.

<p align="center"> <img src="https://github.com/microsoft/TaskTracker/blob/main/assets/teaser.png" width="700"> </p>

Key features:

  • Detects when an LLM deviates from a user's original instructions due to malicious prompts injected into external data sources
  • Works across multiple state-of-the-art LLMs including Mistral 7B, Llama-3 8B,Llama-3 70B, Mixtral 8x7B, and Phi-3 3.8B
  • Achieves over 0.99 ROC AUC on out-of-distribution test data spanning jailbreaks, malicious instructions, and unseen task domains
  • Does not require model fine-tuning or output generation, maximizing deployability and efficiency
  • Generalizes well to detect various types of task drift without being trained on specific attacks

The repo includes:

  • A script to run the trained probes on new datasets/examples.
  • Steps to recreate our exact large-scale dataset (500K+ examples), and generate a new one, for training and evaluating task drift detection
  • Form to request access to the pre-computed activations
  • Code to extract and analyze LLM activations
  • Implementations of linear and metric learning probes for task drift classification
  • Evaluation scripts and pre-trained models

TaskTracker enables more secure use of LLMs in retrieval-augmented applications by catching unwanted deviations from user instructions. It also opens up new directions for LLM interpretability and control.

Table of Content


Request access to LLM activations

To request access to the activation data we generated for simulating/evaluating task drift, please fill out this form and we will respond with a time-restricted download link (coming soon, we will send download links as soon as they are available).

Download Data

  1. Install azcopy. For Macbook, run brew install azcopy.
  2. Run the following command:
    • Replace {MODEL_NAME} and {DATA_DISTRIBUTION} with your target values. The available target values are listed below and must be transcribed exactly.
    • Replace {SAS_TOKEN} with the provided SAS key.
    • Update <LOCAL_PATH> with the local location you want to download the data.
azcopy copy 'https://tasktrackeropensource.blob.core.windows.net/activations/{MODEL_NAME}/{DATA_DISTRIBUTION}?{SAS_TOKEN}' <LOCAL_PATH> --recursive

Target Values

Models

  • phi__3__3.8
  • mistral__7B
  • mistral__7B__no_priming
  • llama__3__8B
  • llama__3__70B

Data Distributions

  • training
  • validation
  • test

Loading the Data

  • The shape of the activation tensor for training subset is the following: [3, BATCH, LAYERS, DIM]:

  • Dim 0 is stored as: Primary task, Clean, and Poisoned

  • Training data was constructed using the same text examples as pairs of clean vs poisoned ones

  • The shape of the activation tensor for validation subset is the following: [2, BATCH, LAYERS, DIM]:

    • Dim 0 is stored as: Primary task, the whole text.
    • Whether it is clean or poisoned depends on the activation files (which is included in the file names)
    
    import torch
    
    # Load the activation data for validation/test 
    clean_activations = torch.load('activations/activations_0.pt')
    poisoned_activations = torch.load('activations/activations_1.pt')
    # Shape: (2, 1000, 32, 4096). For training files, this would be (3, 1000, 32, 4096)
    
    
    # Subtract the first dimension of the activations (to remove the instruction and only compare data blocks)
    clean_activations = clean_activations[1] - clean_activations[0]
    poisoned_activations = poisoned_activations[1] - poisoned_activations[0]
    # Shape: (1000, 32, 4096)
    

Environment Setup

  1. Create and activate the conda environment:
conda env create -f environment.yml
conda activate tasktracker
  1. Install packages and setup a local instance of the TaskTracker package:
cd TaskTracker
pip install -e .

New data

1- Check quick_start for a simple way to run on new data

2- Edit quick_start/config.yaml for configurations of classifier path, which LLM, parameters of layers and thresholds, etc.

3- Check the structure of data in quick_start/mock_data.json. You can prepare your data as

[
 {
  "user_prompt": "primary task",
  "text": "paragraph, can be clean or poisoned",
  "label": 1 (poisoned), 0 (clean)
 }
]

4- run quick_start/main_quick_test.py. According to which LLM/Task Tracker you are using, change torch_type when loading the LLM (check TaskTracker/task_tracker/config/models.py for the precision we used for each LLM).


Dataset Construction

We provide pre-sampled dataset examples for training and evaluation (see option 1 for regenerating our exact data which you probably may need to do if you are using our precomputed activations).

Option 1: Using Pre-sampled Dataset

  1. We provide scripts to regenerate our dataset exactly (which you can verify with prompt hashes values).
  2. Please run the notebooks in task_tracker/dataset_creation/recreate_dataset which will automatically download the relevant resources and build the dataset. No change is required.
  3. Update the dataset file paths in task_tracker/config/models.py to point to your created files.

Option 2: Constructing Your Own Dataset

To create your own dataset:

  1. Run the Jupyter notebooks in task_tracker/dataset_creation/ to prepare training, validation, and test datasets.
  2. Update dataset file paths in task_tracker/config/models.py to point to your newly generated files.

Dependencies

  • This repository includes:
    • GPT-4 generated triggers
    • Trivia questions and answers pairs
    • Translated subset of trivia questions and answers
    • Generic NLP tasks
    • Attack prompts from BIPIA
  • Dataset construction scripts automatically download:
    • HotPotQA, SQuAD, Alpaca, Code Alpaca, WildChat, and other datasets from HuggingFace or their hosting websites
  • Some jailbreak examples require manual download (URLs provided in corresponding notebooks)

Jupyter Notebooks for Dataset Creation

Note: Each notebook contains detailed instructions and customization options. Adjust parameters as needed for your specific use case.

  1. prepare_training_dataset.ipynb: Samples training data from SQuAD training split

    • Customize with args.orig_task, args.emb_task, and args.embed_loc
    • See training_dataset_combinations.ipynb for combination examples
  2. prepare_datasets_clean_val.ipynb: Samples clean validation data

    • Uses HotPotQA and SQuAD validation splits
    • Primary tasks: QA or Mix of QA and generic NLP prompts
  3. prepare_datasets_clean_test.ipynb: Samples clean test data

    • Uses HotPotQA training split
    • Primary tasks: QA or Mix of QA and generic NLP prompts
  4. prepare_datasets_poisoned_val.ipynb: Samples poisoned validation data

  5. prepare_datasets_poisoned_test.ipynb: Samples poisoned test data

  6. prepare_datasets_poisoned_test_other_variations.ipynb: Generates variations of poisoned injections (trigger variations)

  7. prepare_datasets_clean_test_spotlight.ipynb: Constructs clean examples with spotlighting prompts

  8. prepare_datasets_poisoned_test_translation_WildChat.ipynb: Constructs WildChat examples (clean examples with instructions) and poisoned examples with translated instructions

Post-Generation Steps

After generating or downloading the dataset:

  • Update the dataset file paths in task_tracker/config/models.py

Activation Generation

  • We provide pre-computed activations for immediate use. To access them, please complete this form.
  • Note: coming soon. We will reply with links to download once they are available.

Option 1: Using Pre-computed Activations

  1. After receiving access, download the activation files.
  2. Update the DATA_LISTS path in task_tracker/training/utils/constants.py to point to your downloaded files.

Option 2: Generating Your Own Activations

To generate activations:

  1. Configure paths in task_tracker/config/models.py:
# HuggingFace cache directory
cache_dir = "/path/to/hf/cache/"
os.environ["TRANSFORMERS_CACHE"] = cache_dir
os.environ["HF_HOME"] = cache_dir

# Activations output directory
activation_parent_dir = "/path/to/store/activations/"

# Dataset text files directory
text_dataset_parent_dir = "/path/to/dataset/text/files/"
  1. Customize activation generation in task_tracker/activations/generate.py:
model_name: str = "mistral"  # Choose from models in task_tracker.config.models
with_priming: bool = True    # Set to False if no priming prompt is needed
  1. (Optional) Modify the priming
View on GitHub
GitHub Stars85
CategoryEducation
Updated9d ago
Forks18

Languages

Jupyter Notebook

Security Score

95/100

Audited on Mar 18, 2026

No findings