Storing long contexts in tiny KV caches with self-study.

</div>

What is this? This repository provides code for training a cartridge, a small KV cache that represents a large corpus of textual information. It uses a test-time training recipe called self-study. The code is based on our paper Cartridges: Lightweight and general-purpose long context representations via self-study.

tl;dr When we put lots of text (e.g. a whole code repo) into a language model's context, generation cost soars because of the KV cache's size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe called self-study, we show that this simple idea can improve throughput by 26× while maintaining quality. (See our blogpost for more.)

Table of contents

Setup
Running Self-Study
- Step 1: Synthesize training data
- Step 2: Run context-distillation (i.e. training) on the synthesized data
  - Step 2.1: Evaluation
Serving Cartridges
TODOs and known issues
Acknowledgments and Citation

Setup

Step 1: Clone the repository and install the Python package.

git clone https://github.com/HazyResearch/cartridges && cd cartridges
pip install uv
uv pip install -e .

Step 2: Set some environment variables

The codebase relies on your setting the following variables. Make sure to include them in your environment (i.e. Add them to your .env, DockerFile, or .bashrc).

# path to your the directory where you cloned this repo
export CARTRIDGES_DIR=/path/to/cartridges

# path to a directory where you want to store outputs like models checkpoints and such
export CARTRIDGES_OUTPUT_DIR=/path/to/cartridges/outputs

# the code in this repository is tightly integrated with wandb
# set your wandb project and entity here
export CARTRIDGES_WANDB_PROJECT=your-wandb-project
export CARTRIDGES_WANDB_ENTITY=your-wandb-username-or-team

Running Self-Study

What is self-study? Self-study is an approach for training a model to understand a corpus of text. It works by generating synthetic conversations about a corpus of text and then training the model on those conversations with a context-distillation objective. The process consists of two AI agents in conversation with one another: one asks questions or makes requests about the content, and another responds using the provided context.

Quickstart: Take a look at the scripts at examples/arxiv/arxiv_synthesize.py and examples/arxiv/arxiv_train.py for a basic example of how to synthesize training data and run context-distillation on the synthesized data. To run the synthesis script, you will need to spin up an inference server (either Tokasaurus or SGLang) and set the client variable to point to it. See below for more details on how to do this.

Below we walk through the process of generating synthetic training data for a corpus of text. As a running example, we'll be training a cartridge on our paper on Cartridges. How meta!

Note: We used Modal to run our inference workloads when developing self-study. Since containers on Modal start up quite quickly, it's practical to scale out horizontally to several dozen GPUs for very short bursts (<5 mins). This is ideal for experimentation with different self-study data synthesis approaches because it makes things more interactive, reducing the time between making a change to the approach and getting feedback during training. In infra/, we provide scripts for deploying inference servers on Modal.

Note: For configuration, we use Pydantic models. Pydantic models are useful for defining the schema of the config and quickly ensuring that the config is valid at the beginning of a run. We also rely on pydrantic, which provides a few utilities for working with configs.

Step 1: Synthesize training data

Note: See examples/arxiv/arxiv_synthesize.py for the full example developed in this section.

Below is the outline of a script for running the synthesis. It simply instantiates a SynthesizeConfig object and runs it with pydrantic.main([config]). Note: Using pydrantic.main is simply a utility that calls the configs .run method, but in a way that allows us to override the config on the command line like so: python your_synthesis_script.py num_samples=1024.

The config has a couple of key fields missing: the resource, which controls what raw text data we're training on, and a client of an inference server (e.g. SGLang or Tokasaurus). We'll cover those two below. There are many other configuration options we're not covering here, so refer to the SynthesizeConfig and SelfStudySynthesizer for the full list and documentation.

from cartridges.synthesize import SynthesizeConfig
from cartridges.synthesizers.self_study import SelfStudySynthesizer

resource_config = ...  # see 'Step 1.1: Configure Resources'
client_config = ...  # see 'Step 1.2: Prepare an Inference Server'

config = SynthesizeConfig(
    synthesizer=SelfStudySynthesizer.Config(
        client=client_config,
        resources=[resource_config],
    ),
    num_samples=512,
    name="cartridges-tutorial",
)

if __name__ == "__main__": 
    # library that allows us to override the Pydantic configs from the command line
    import pydrantic  
    pydrantic.main([config])

Step 1.1: Configure Resources

A "resource" is an object that feeds chunks of the context and a "seed prompt" to a synthesizer. See Section 4 of our paper for more details.

Since we want to train Cartridge for a research paper, we'll use the TextFileResource type.

from cartridges.data.resources import TextFileResource

resource_config = TextFileResource.Config(
    path= "examples/arxiv/cartridges.tex",
    seed_prompts=["structuring", "summarization", "question"],
    chunker=TokenChunker.Config(
        tokenizer=client.model_name,
        min_tokens_per_chunk=512,
        max_tokens_per_chunk=1024,
    ),
)

We provide several other basic resource types for common data formats like JSONResource.

We're also gradually adding more specialized resource types like that do a better job chunking specific data formats and feeding relevant seed prompts:

LaTeXResource for training a Cartridge on a LaTeX project. In fact, we could have used this instead of the TextFileResource above: LaTeXResource.Config(arxiv_id="2506.06266", ...)
SlackResource for training a Cartridge on Slack messages through the Slack API. This uses the Slack API to fetch recent messages from your channels.
GMailResource for Gmail messages. This uses an MCP server to fetch recent messages from your inbox.

Step 1.2: Prepare an Inference Server

Self-study requires an inference server to generate the synthetic conversations. We need to configure a Client object that points to the inference server. We support two options:

Tokasaurus (recommended) - We ran all of our experiments with Tokasaurus, which provides higher throughput generation and is easier to modify.
SGLang - We're also providing support for SGLang, but we have not tested it extensively.

<details> <summary> Option A: Modal Deployment (Tokasaurus) </summary>

We found it easy to run data generation with Modal's serverless horizontal scaling.

For cloud deployment, you can deploy on Modal:

modal deploy infra/modal_deploy_tksrs.py

Then configure with the modal URL:

from cartridges.clients.tokasaurus import TokasaurusClient

client_config = TokasaurusClient.Config(
    url="https://your-modal-deployment-url.modal.run",
    model_name="Qwen/Qwen3-4b"
)

Note: Make sure to tune the ALLOW_CONCURRENT_INPUTS (which

Cartridges

Install / Use

README

Setup

Running Self-Study

Step 1: Synthesize training data

Step 1.1: Configure Resources

Step 1.2: Prepare an Inference Server

Related Skills