N8loom
A tree-based prefix cache library that allows rapid creation of looms: hierarchal branching pathways of LLM generations.
Install / Use
/learn @N8python/N8loomREADME
N8Loom: For Fast Tree-of-Thought Inference
N8Loom is a Python library built on top of mlx_lm and Transformers that enables structured, tree-based interactions with language models. It's main selling point is its KV cache tree: it stores individual 'fragments' of the KV cache at each node in the tree, which it then concatenates to form the full cache when generating text from a node in the tree. This allows maintaining the cache of many different branches of the tree in parallel, and then merging them together when needed. This gives the inference improvements of caching without the overhead of storing the entire prefix cache at each node.
Below is a visualization of the critical difference when generating from a single node in the tree - a standard prompt cache must recompute the cache for parent nodes each time, while the KV cache tree can simply concatenate the cache fragments stored at each node to form the full cache.
| Standard Prompt Cache | Loom Cache |
|------------------------|------------|
|
|
|
It additionally provides a set of utilities to manage internal model caches, generate text in parallel and stream mode, and build reasoning trees where each node represents a model “thought” (called a Heddle) that can branch off into multiple potential continuations. The library also includes a FastAPI server example for deploying a web service.
Table of Contents
- N8Loom: For Fast Tree-of-Thought Inference
Overview
N8Loom makes it easy to interact with language models by allowing you to:
- Cache and manipulate intermediate model states.
Utilities incache_utils.pyextract, clip, and fuse key-value caches (KV caches) for each model layer. - Create and manage reasoning trees.
The core abstractions are theHeddleandLoomclasses (inloom.py), which represent individual reasoning nodes and the overall prompt tree respectively. - Generate responses in batches and streams.
Use the functions inutils.pyto prefill caches, sample model outputs in parallel, or yield token-by-token updates.
Installation
Ensure you have Python 3.7+ installed. Then, install the required dependencies:
pip install -r requirements.txt
Or install:
pip install n8loom
Usage
Basic Script Example
Note that n8loom only works w/ the llama architecture.
Below is an example (from examples/reflection.py) demonstrating how to load a model, create a reasoning tree (a Loom), and expand it with multiple potential answers:
from mlx_lm import load
from n8loom import Loom, load_for_loom
# Load the model and tokenizer
model, tokenizer = load_for_loom("Llama-3.2-3B-Instruct-4bit")
# Define a problem prompt
prompt = (
"Tobias is buying a new pair of shoes that costs $95. He has been saving up his money each month "
"for the past three months. He gets a $5 allowance a month. He also mows lawns and shovels driveways. "
"He charges $15 to mow a lawn and $7 to shovel. After buying the shoes, he has $15 in change. "
"If he mows 4 lawns, how many driveways did he shovel?"
)
# Create a Loom (root of the reasoning tree)
root = Loom(model, tokenizer, prompt)
# Add an initial text child to guide the model's reasoning
assistant_start = root.add_text_child("I will solve this problem step by step and be mindful of mistakes.")
# Expand the reasoning tree by generating 8 potential response branches
assistant_start.ramify(n=8, temp=0.6, max_tokens=512, min_p=0.05)
# Apply further reasoning to all leaf nodes, incorporating reflection
answers = assistant_start.apply_at_leaves(
lambda x: x.ramify("\n...Wait. I need to look at the problem again. Let's think about what I could've gotten wrong. I could've")
if x.terminal else None,
lambda x: x.ramify(n=2, temp=0.6, max_tokens=512, min_p=0.05),
lambda x: x.crown()
)
# Print the generated answers
for i, answer in enumerate(answers):
print(f"Answer {i+1}:\n{answer}\n")
Running the FastAPI Server
The library also comes with an example FastAPI server (see examples/server.py) that exposes endpoints to manage models, create looms, expand nodes, and export/import reasoning trees.
Make sure you have an mlx-lm model in the root directory (the parent directory of src). You can quickly do this w/:
pip install huggingface_hub hf_transfer
huggingface-cli download --local-dir Llama-3.2-3B-Instruct-4bit mlx-community/Llama-3.2-3B-Instruct-4bit
To run the server locally:
python src/n8loom/examples/server.py
API Documentation
Core Classes (loom.py)
class Heddle
A Heddle represents a node in a reasoning tree. Each node contains a segment of text, its tokenized form, cache fragments from the model, and potential child nodes. This structure enables branching reasoning and interactive exploration of model-generated responses.
-
Attributes:
model: The language model (an instance ofnn.Module) used for generating responses and cache fragments.tokenizer: The tokenizer (aPreTrainedTokenizerorTokenizerWrapper) used to encode text into tokens and decode tokens back to text.text: The text content of this node.tokens: The tokenized representation (a list of token IDs) for the node’s text.frag: A list of cache fragments (KVFrag) that store model cache information corresponding to the tokens.children: A list of child Heddle nodes representing subsequent branches in the reasoning tree.parent: A reference to the parent Heddle node (orNoneif this node is the root).terminal: A Boolean flag indicating whether further expansion (generation) is disallowed.
-
Constructor:
__init__(model, tokenizer, text, frags, children, parent=None, trim_toks=1)- Purpose: Initializes a new Heddle node.
- Parameters:
model: The language model to use.tokenizer: The tokenizer to encode/decode text.text: The text prompt for the node.frags: An optional list of pre-computed cache fragments. IfNone, the fragments are generated based on the text.children: An optional list of child nodes (defaults to an empty list if not provided).parent: The parent node (defaults toNonefor the root).trim_toks: The number of initial tokens to trim from the token list (default is 1).
-
Key Methods:
-
clip(token_limit: int)- Purpose: Clips the node’s tokens, text, and cache fragments to a specified token limit.
- Details:
- If
token_limitis negative, it retainslen(tokens) + token_limittokens. - If the number of tokens exceeds the limit, the node’s tokens are truncated, the text is updated via decoding, the cache fragments are clipped accordingly, and all children are removed.
- If
- Returns: The current Heddle instance.
-
trim(token_trim: int)- Purpose: Removes the last
token_trimtokens from the node. - Details: Internally calls
clipwith a negative token limit. - Returns: The current Heddle instance.
- Purpose: Removes the last
-
to_leaf()- Purpose: Converts the current node into a leaf node by removing all its children.
- Returns: The current Heddle instance.
-
add_child(child: Heddle)- Purpose: Adds an existing Heddle node as a child.
- Details: Also sets the added child’s
parentattribute to this node. - Returns: The added child node.
-
add_text_child(text: str)- Purpose: Creates a new child node from a text prompt and adds it as a child.
- Returns: The newly created child node.
-
remove_child(child: Heddle)- Purpose: Removes a specified child node from the current node.
- Returns: The removed child node.
-
get_prefix_cache() -> List[KVCache]- Purpose: Retrieves the cumulative cache from the root node up to the current node.
- Details: Collects and fuses cache fragments from all ancestor nodes to form a complete context cache.
- Returns: A list of fused
KVCacheobjects.
-
make_children(n: int = 4, temp: float = 0.8, max_tokens: int = 8, min_p: float = 0.05, **kwargs)- Purpose: Generates multiple child nodes using batched model generation.
- Details:
- Uses the current node’s cumulative cache as context.
- Calls a batched generation routine to generate new text completions.
- For each generated text, a new child is created.
- If generation signals termination (via an
endedflag), the child is marked as terminal. - Clears the model cache after generation.
- Parameters:
n: Number of children to generate.temp: Sampling temperature.max_tokens: Maximum number of tokens to generate for each child.min_p: Minimum probability threshold for generation.
- Returns: A list of newly created child nodes.
-
`ramify(arg: Optional[Union[str, List[str]]] = None, **
-
