Cartridges
Storing long contexts in tiny caches with self-study
Install / Use
/learn @HazyResearch/CartridgesREADME
Storing long contexts in tiny KV caches with self-study.
<!--  --> </div>What is this? This repository provides code for training a cartridge, a small KV cache that represents a large corpus of textual information. It uses a test-time training recipe called self-study. The code is based on our paper Cartridges: Lightweight and general-purpose long context representations via self-study.
tl;dr When we put lots of text (e.g. a whole code repo) into a language model's context, generation cost soars because of the KV cache's size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe called self-study, we show that this simple idea can improve throughput by 26× while maintaining quality. (See our blogpost for more.)
Table of contents
Setup
Step 1: Clone the repository and install the Python package.
git clone https://github.com/HazyResearch/cartridges && cd cartridges
pip install uv
uv pip install -e .
Step 2: Set some environment variables
The codebase relies on your setting the following variables. Make sure to include them in your environment (i.e. Add them to your .env, DockerFile, or .bashrc).
# path to your the directory where you cloned this repo
export CARTRIDGES_DIR=/path/to/cartridges
# path to a directory where you want to store outputs like models checkpoints and such
export CARTRIDGES_OUTPUT_DIR=/path/to/cartridges/outputs
# the code in this repository is tightly integrated with wandb
# set your wandb project and entity here
export CARTRIDGES_WANDB_PROJECT=your-wandb-project
export CARTRIDGES_WANDB_ENTITY=your-wandb-username-or-team
Running Self-Study
What is self-study? Self-study is an approach for training a model to understand a corpus of text. It works by generating synthetic conversations about a corpus of text and then training the model on those conversations with a context-distillation objective. The process consists of two AI agents in conversation with one another: one asks questions or makes requests about the content, and another responds using the provided context.
Quickstart: Take a look at the scripts at examples/arxiv/arxiv_synthesize.py and examples/arxiv/arxiv_train.py for a basic example of how to synthesize training data and run context-distillation on the synthesized data. To run the synthesis script, you will need to spin up an inference server (either Tokasaurus or SGLang) and set the client variable to point to it. See below for more details on how to do this.
Below we walk through the process of generating synthetic training data for a corpus of text. As a running example, we'll be training a cartridge on our paper on Cartridges. How meta!
<!-- Here are the steps: 1. Synth 1. Configure resources that contain the data you want to store in the cartridge 2. Ensure you have an inference server running (either [Tokasaurus](https://github.com/ScalingIntelligence/tokasaurus) or [SGLang](https://github.com/ScalingIntelligence/tokasaurus)) and configure your client to point to it 3. Instantiate a `SynthesizeConfig` object that contains the parameters for the self-study process 4. Put it all together in one script and run it! 5. Run context-distillation (i.e. training) on the synthesized data -->Note: We used Modal to run our inference workloads when developing self-study. Since containers on Modal start up quite quickly, it's practical to scale out horizontally to several dozen GPUs for very short bursts (<5 mins). This is ideal for experimentation with different self-study data synthesis approaches because it makes things more interactive, reducing the time between making a change to the approach and getting feedback during training. In
infra/, we provide scripts for deploying inference servers on Modal.
Note: For configuration, we use Pydantic models. Pydantic models are useful for defining the schema of the config and quickly ensuring that the config is valid at the beginning of a run. We also rely on
pydrantic, which provides a few utilities for working with configs.
Step 1: Synthesize training data
Note: See examples/arxiv/arxiv_synthesize.py for the full example developed in this section.
Below is the outline of a script for running the synthesis. It simply instantiates a SynthesizeConfig object and runs it with pydrantic.main([config]). Note: Using pydrantic.main is simply a utility that calls the configs .run method, but in a way that allows us to override the config on the command line like so: python your_synthesis_script.py num_samples=1024.
The config has a couple of key fields missing: the resource, which controls what raw text data we're training on, and a client of an inference server (e.g. SGLang or Tokasaurus). We'll cover those two below.
There are many other configuration options we're not covering here, so refer to the SynthesizeConfig and SelfStudySynthesizer for the full list and documentation.
from cartridges.synthesize import SynthesizeConfig
from cartridges.synthesizers.self_study import SelfStudySynthesizer
resource_config = ... # see 'Step 1.1: Configure Resources'
client_config = ... # see 'Step 1.2: Prepare an Inference Server'
config = SynthesizeConfig(
synthesizer=SelfStudySynthesizer.Config(
client=client_config,
resources=[resource_config],
),
num_samples=512,
name="cartridges-tutorial",
)
if __name__ == "__main__":
# library that allows us to override the Pydantic configs from the command line
import pydrantic
pydrantic.main([config])
Step 1.1: Configure Resources
A "resource" is an object that feeds chunks of the context and a "seed prompt" to a synthesizer. See Section 4 of our paper for more details.
Since we want to train Cartridge for a research paper, we'll use the TextFileResource type.
from cartridges.data.resources import TextFileResource
resource_config = TextFileResource.Config(
path= "examples/arxiv/cartridges.tex",
seed_prompts=["structuring", "summarization", "question"],
chunker=TokenChunker.Config(
tokenizer=client.model_name,
min_tokens_per_chunk=512,
max_tokens_per_chunk=1024,
),
)
We provide several other basic resource types for common data formats like JSONResource.
We're also gradually adding more specialized resource types like that do a better job chunking specific data formats and feeding relevant seed prompts:
LaTeXResourcefor training a Cartridge on a LaTeX project. In fact, we could have used this instead of theTextFileResourceabove:LaTeXResource.Config(arxiv_id="2506.06266", ...)SlackResourcefor training a Cartridge on Slack messages through the Slack API. This uses the Slack API to fetch recent messages from your channels.GMailResourcefor Gmail messages. This uses an MCP server to fetch recent messages from your inbox.
Step 1.2: Prepare an Inference Server
Self-study requires an inference server to generate the synthetic conversations. We need to configure a Client object that points to the inference server. We support two options:
- Tokasaurus (recommended) - We ran all of our experiments with Tokasaurus, which provides higher throughput generation and is easier to modify.
- SGLang - We're also providing support for SGLang, but we have not tested it extensively.
We found it easy to run data generation with Modal's serverless horizontal scaling.
For cloud deployment, you can deploy on Modal:
modal deploy infra/modal_deploy_tksrs.py
Then configure with the modal URL:
from cartridges.clients.tokasaurus import TokasaurusClient
client_config = TokasaurusClient.Config(
url="https://your-modal-deployment-url.modal.run",
model_name="Qwen/Qwen3-4b"
)
Note: Make sure to tune the
ALLOW_CONCURRENT_INPUTS(which
Related Skills
node-connect
340.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.1kCommit, push, and open a PR
