TritPack

Balanced ternary memory compression for LLMs.

TritPack compresses model weights and KV cache to ternary values {-1, 0, +1}, achieving ~3–6× RAM reduction with minimal quality loss — enabling larger models on consumer hardware.

Why TritPack?

Modern LLMs are memory-bound. A 70B parameter model requires ~140 GB in FP16, far exceeding consumer GPU/CPU RAM. TritPack provides:

~3–6× RAM reduction via ternary quantization (1.6 bits/weight vs 16)
Per-block calibration — each 64-element block has its own scale factor
Three-tier memory management — HOT (FP16) → WARM (INT8) → COLD (ternary)
Rust core with SIMD-optimised packing and rayon parallelism
Zero-copy lazy decompression — decompress only what you need

RAM Savings

| Model | FP16 (GB) | TritPack (GB) | Ratio | Fits on | |--------|-----------|---------------|-------|--------------------| | 7B | 14 | ~2.5 | ~5.6× | 8 GB laptop RAM | | 13B | 26 | ~4.6 | ~5.7× | 8 GB laptop RAM | | 34B | 68 | ~12 | ~5.7× | 16 GB desktop RAM | | 70B | 140 | ~25 | ~5.6× | 32 GB desktop RAM |

Ratios are analytical estimates (n/5 bytes for trits + n/64×2 bytes for scales). Actual ratios vary by layer sparsity and block overhead.

Architecture

Input tensor (FP32/FP16)
         │
         ▼
  ┌─────────────────┐
  │  TernaryQuantizer│   Per-block: τ = α × mean(|block|)
  │  (block_size=64) │   trits = sign(x) × [|x| > τ]
  └────────┬────────┘   scale = mean(|x| for |x| > τ)
           │
           ▼
  ┌─────────────────┐
  │   pack_trits()   │   5 trits per byte (3⁵=243 ≤ 256)
  │  (Rust + Python) │   byte = d[0]×81 + d[1]×27 + … + d[4]
  └────────┬────────┘
           │
           ▼
  ┌─────────────────┐
  │  QuantizedTensor │   packed_data + float16 scales per block
  └────────┬────────┘
           │
  ┌────────▼────────┐
  │   TierManager    │   HOT (FP16) ─► WARM (INT8) ─► COLD (ternary)
  │   LRU eviction   │   Promotes on access, demotes on memory pressure
  └─────────────────┘

Quick Start

Install

git clone https://github.com/z-tejani/TritPack
cd TritPack
pip install -e ".[dev]"

# Optional: build Rust extension for 10-100× faster packing
cd rust && maturin develop --release && cd ..

Compress a tensor

import numpy as np
from tritpack import TernaryQuantizer, TernaryDequantizer

# Load your model weights
weights = np.random.randn(4096, 4096).astype(np.float32)  # ~64 MB

# Compress
q = TernaryQuantizer(alpha=0.7)
qt = q.quantize(weights)

print(f"Original: {weights.nbytes / 1e6:.1f} MB")
print(f"Compressed: {qt.size_bytes() / 1e6:.1f} MB")
print(f"Ratio: {qt.compression_ratio():.1f}×")

# Decompress
dq = TernaryDequantizer()
reconstructed = dq.dequantize(qt)

Use TritTensor (lazy decompression)

from tritpack import TritTensor

tt = TritTensor(weights)
print(tt)  # TritTensor(shape=(4096, 4096), compression_ratio=5.73×, sparsity=41.9%)

# Decompress only when needed
data = tt.numpy()

# Quality report
report = tt.quality_report()
print(report)  # {'cos_sim': 0.90, 'snr_db': 7.1, 'rmse': 0.44, ...}

Run the tier manager (simulates LLM inference)

from tritpack import TierManager
import numpy as np

# 8 GB total, 15% HOT, 25% WARM, rest COLD
tm = TierManager(memory_limit_gb=8.0)

# Register model layers
for i in range(32):
    layer = np.random.randn(4096, 4096).astype(np.float32)
    tm.register_layer(f"layer_{i}", layer)

# Access layers — automatic promotion/demotion
for i in range(32):
    weights = tm.access(f"layer_{i}")
    # ... run forward pass with weights ...

print(tm.memory_usage())
# {'hot_gb': 1.2, 'warm_gb': 2.0, 'cold_gb': 4.7, 'total_gb': 7.9}

Calibrate per-layer alpha

from tritpack.model.calibration import ThresholdCalibrator

cal = ThresholdCalibrator()
layers = {"q_proj": weights_q, "v_proj": weights_v}

# Find highest alpha with cos_sim >= 0.90
alphas = cal.calibrate_model_layers(layers, target_cosine_sim=0.90)
# {'q_proj': 0.58, 'v_proj': 0.61}

Local Model Inference

TritPack integrates with the two most common local-inference stacks. Both paths store weights in compressed form and decompress on demand.

Path A — llama.cpp / Ollama (via GGUF reconstruction)

Works with any GGUF model: llama.cpp, llama-cpp-python, Ollama, LM Studio, etc.

pip install gguf llama-cpp-python

Step 1 — compress once (≈5 min for a 7B model, saved permanently):

from integration.gguf_patcher import GGUFPatcher

patcher = GGUFPatcher(alpha=0.7)
report = patcher.patch("llama-7b.gguf", "llama-7b.tritpack/")
print(f"Compressed {report.ratio:.1f}× → {report.compressed_gb:.1f} GB")
# Compressed 5.6× → 2.5 GB

Step 2 — load for inference (decompresses to a temp GGUF on first run, reused on subsequent runs):

from integration.llamacpp_shim import load_from_tritpack

model = load_from_tritpack(
    tritpack_dir="llama-7b.tritpack/",
    temp_gguf="/tmp/llama-7b-reconstructed.gguf",
    reuse_temp=True,      # skips reconstruction after first run
    verbose=True,
    n_ctx=2048,           # llama_cpp.Llama kwargs pass through
)

output = model("The meaning of life is", max_tokens=80)
print(output["choices"][0]["text"])

Or use the LlamaCppTritShim class directly when you want to load a .tritpack/ directory through the full tier-managed interface:

from integration.llamacpp_shim import LlamaCppTritShim

shim = LlamaCppTritShim(
    model_path="llama-7b.tritpack/",
    memory_limit_gb=8.0,
    temp_gguf="/tmp/llama-7b-reconstructed.gguf",
)
model = shim.load(verbose=True)

Disk vs RAM tradeoffs:

| Stage | Disk | RAM | |-------|------|-----| | Original GGUF | 14 GB | 14 GB | | .tritpack/ compressed | 2.5 GB | — | | Reconstructed GGUF (temp) | 14 GB | 14 GB |

The temp GGUF is full FP32 quality (slightly below original due to ternary quantization, cosine similarity ≈ 0.90). The main saving here is storage: your model library stays 5× smaller on disk.

Path B — HuggingFace transformers (in-process RAM reduction)

This path actually reduces RAM during inference: weights stay ternary-compressed in memory and each layer is decompressed only when its forward() is called.

pip install torch transformers

One-liner:

from integration.transformers_shim import load_compressed_model

model, tokenizer = load_compressed_model(
    "meta-llama/Llama-2-7b-hf",
    alpha=0.7,
    verbose=True,
)

inputs = tokenizer("Hello, I am a language model", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Or compress an already-loaded model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from integration.transformers_shim import compress_model

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Replace all nn.Linear layers with ternary-compressed equivalents
model = compress_model(model, alpha=0.7, verbose=True)
# Compressed 147 Linear layers.

# Use exactly like the original model
inputs = tokenizer("Once upon a time", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))

RAM comparison (Llama-2 7B):

| Mode | Weight RAM | |------|-----------| | from_pretrained (FP32) | ~28 GB | | from_pretrained (FP16) | ~14 GB | | compress_model(alpha=0.7) | ~2.5 GB |

Inference is slightly slower than native (decompression overhead per layer). Use cache_weights=True to cache decompressed tensors after first use — faster repeated inference at the cost of higher RAM.

Estimate savings before committing

from integration.ollama_proxy import OllamaMemoryEstimator

est = OllamaMemoryEstimator("llama3:8b")
print(est.estimate())
# {'model': 'llama3:8b', 'params_b': 8.0, 'fp16_gb': 14.9,
#  'tritpack_gb': 2.7, 'ratio': 5.5}

Benchmarks

Run the benchmark suite:

python -m tritpack.benchmark bench-all
python -m tritpack.benchmark bench-pack
python -m tritpack.benchmark bench-quality --output quality.json

Sample results (Python reference implementation, 8-core x86):

| Operation | Size | Throughput | |----------------------|--------|------------| | pack_trits (Python) | 100 MB | ~400 MB/s | | pack_trits (Rust) | 100 MB | ~8 GB/s | | quantize (Python) | 64 MB | ~150 MB/s | | TierManager access | HOT | <0.1 ms | | TierManager access | COLD | <5 ms |

Fill in after running: pytest tests/benchmarks/ --benchmark-json=bench.json

Quality / Compression Tradeoff

The alpha parameter controls the tradeoff:

| alpha | Sparsity | cos_sim | SNR | Ratio | |-------|----------|---------|--------|-------| | 0.3 | ~19% | ~0.86 | ~12 dB | ~4.8× | | 0.5 | ~31% | ~0.88 | ~9 dB | ~5.2× | | 0.7 | ~42% | ~0.90 | ~7 dB | ~5.7× | | 0.9 | ~53% | ~0.89 | ~6 dB | ~6.2× | | 1.0 | ~58% | ~0.87 | ~5 dB | ~6.5× |

Cosine similarity peaks near alpha≈0.55–0.70 for standard Gaussian weights.

For fine-grained control, use ThresholdCalibrator to find the optimal alpha per layer while satisfying a cosine similarity target.

Encoding Details

5 trits are packed into 1 byte (3⁵ = 243 ≤ 256):

byte = d[0]×81 + d[1]×27 + d[2]×9 + d[3]×3 + d[4]
where d[i] = trit[i] + 1  (shift: −1→0, 0→1, 1→2)

The 8-byte header encodes the original trit count (little-endian uint64) for exact padding removal on unpack.

Running Tests

# All tests
pytest tests/ -x --tb=short

# With coverage
pytest tests/ --cov=tritpack --cov-report=term-missing

# Unit only (fast)
pytest tests/unit/

TritPack

Install / Use

README

TritPack

Why TritPack?

RAM Savings

Architecture

Quick Start

Install

Compress a tensor

Use TritTensor (lazy decompression)

Run the tier manager (simulates LLM inference)

Calibrate per-layer alpha

Local Model Inference

Path A — llama.cpp / Ollama (via GGUF reconstruction)

Path B — HuggingFace transformers (in-process RAM reduction)

Estimate savings before committing

Benchmarks

Quality / Compression Tradeoff

Encoding Details

Running Tests