Zse
Zyora Server Inference Engine for LLM .
Install / Use
/learn @Zyora-Dev/ZseREADME
Ultra memory-efficient LLM inference engine with native INT4 CUDA kernels.
Run 32B models on 24GB GPUs. Run 7B models on 8GB GPUs. Fast cold starts, single-file deployment.
🆕 v1.4.1: Model Hub + Pull Commands
Train 7B models on 8GB GPUs. Train 70B models on 48GB GPUs.
from zse.format import load_zse_model
from zse.training import LoRAConfig, add_lora_to_model, save_lora_adapter
# Load INT4 model (uses ~6GB for 7B)
model, tokenizer, info = load_zse_model("model.zse", device="cuda")
# Add LoRA adapters (~1% trainable params)
config = LoRAConfig(rank=16, alpha=32)
model = add_lora_to_model(model, config)
# Train as usual with PyTorch/HuggingFace
# ... your training loop ...
# Save adapter (tiny file, ~25MB for 7B)
save_lora_adapter(model, "my_adapter.safetensors")
Install training dependencies: pip install zllm-zse[training]
🚀 Benchmarks (Verified, March 2026)
ZSE Custom Kernel (Default)
| Model | File Size | VRAM | Speed | Cold Start | GPU | |-------|-----------|------|-------|------------|-----| | Qwen 7B | 5.57 GB | 5.67 GB | 37.2 tok/s | 5.7s | H200 | | Qwen 14B | 9.95 GB | 10.08 GB | 20.8 tok/s | 10.5s | H200 | | Qwen 32B | 19.23 GB | 19.47 GB | 10.9 tok/s | 20.4s | H200 | | Qwen 72B | 41.21 GB | 41.54 GB | 6.3 tok/s | 51.8s | H200 |
ZSE bnb Backend (Alternate)
| Model | VRAM | Speed | Cold Start | |-------|------|-------|------------| | Qwen 7B | 6.57 GB | 45.6 tok/s | 6.0s | | Qwen 14B | 11.39 GB | 27.6 tok/s | 7.1s | | Qwen 32B | 22.27 GB | 20.4 tok/s | 20.8s | | Qwen 72B | 47.05 GB | 16.4 tok/s | 53.0s |
VRAM Comparison
| Model | Custom Kernel | bnb Backend | Savings | |-------|---------------|-------------|----------| | 7B | 5.67 GB | 6.57 GB | -0.90 GB (14%) | | 14B | 10.08 GB | 11.39 GB | -1.31 GB (12%) | | 32B | 19.47 GB | 22.27 GB | -2.80 GB (13%) | | 72B | 41.54 GB | 47.05 GB | -5.51 GB (12%) |
GPU Compatibility
| GPU | VRAM | Max Model | |-----|------|-----------| | RTX 3070/4070 | 8GB | 7B | | RTX 3080 | 12GB | 14B | | RTX 3090/4090 | 24GB | 32B | | A100-40GB | 40GB | 32B | | A100-80GB / H200 | 80-141GB | 72B |
Key Features
- 📦 Single .zse File: Model + tokenizer + config in one file
- 🚫 No Network Calls: Everything embedded, works offline
- ⚡ ZSE Custom Kernel: Native INT4 inference with maximum VRAM efficiency
- 🧠 Memory Efficient: 72B in 41GB, 32B in 19GB, 7B in 5.7GB VRAM
- 🏃 Fast Cold Start: 5.7s for 7B, 20s for 32B, 52s for 72B
- 🎯 Dual Backend: Custom kernel (default) or bnb backend (alternate)
- 🔥 QLoRA Training: Fine-tune INT4 models with LoRA adapters (NEW in v1.4.0)
- 📄 Built-in RAG with .zpf: 25% fewer LLM tokens at 100% accuracy vs plain chunking
Installation
pip install zllm-zse
Requirements:
- Python 3.11+
- CUDA GPU (8GB+ VRAM recommended)
- bitsandbytes (auto-installed)
Quick Start
1. Pull a Pre-Converted Model (Fastest)
# Download ready-to-use .zse model (no GPU needed for conversion)
zse pull qwen-7b # 5.18 GB download
zse pull mistral-7b # 3.86 GB download
zse pull qwen-0.5b # 0.69 GB download
# Browse available models
zse list
zse list --cached
2. Or Convert Model to .zse Format (One-Time)
# Convert any HuggingFace model
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen7b.zse
zse convert Qwen/Qwen2.5-32B-Instruct -o qwen32b.zse
# Or in Python
from zse.format.writer import convert_model
convert_model("Qwen/Qwen2.5-7B-Instruct", "qwen7b.zse", quantization="int4")
2. Load and Run
from zse.format.reader_v2 import load_zse_model
# Load model (auto-detects optimal settings)
model, tokenizer, info = load_zse_model("qwen7b.zse")
# Generate
inputs = tokenizer("Write a poem about AI:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
3. Start Server (OpenAI-Compatible)
zse serve qwen7b.zse --port 8000
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")
response = client.chat.completions.create(
model="qwen7b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
.zse Format Benefits
| Feature | HuggingFace | .zse Format | |---------|-------------|-------------| | Cold start (7B) | 45s | 9s | | Cold start (32B) | 120s | 24s | | Network calls on load | Yes | No | | Files to manage | Many | One | | Quantization time | Runtime | Pre-done |
.zpf: Built-in RAG with Token Cost Reduction
.zpf delivers 100% retrieval accuracy at 25% lower LLM token cost compared to plain chunking — tested on real noisy web content.
.zpf (Z Packed Format) is ZSE's built-in semantic document format for RAG. It compresses documents at write-time, stripping noise (cookie banners, nav, boilerplate, filler prose) while preserving what the LLM needs.
Cost Benchmark (CNN noisy web article)
| Metric | .zpf | Plain Chunking | |--------|------|----------------| | Correct answers | 10/10 | 10/10 | | Tokens sent to LLM (avg/query) | 943 | 1,257 | | Token reduction | 25% | — | | Cost per 1M queries (GPT-4o) | $2,357 | $3,143 | | Annual savings at 1M queries | $786 | — | | Break-even | 1 query per document | — |
Quick Start
# Ingest a document into RAG store
zse rag add paper.pdf --title "ML Paper"
# Semantic search
zse rag search "What is batch normalization?" -k 5
# Get token-budgeted LLM context
zse rag context "What is batch normalization?" --max-tokens 500
# List all documents
zse rag list
# Export to open formats (zero vendor lock-in)
zse rag export paper.zpf -f markdown -o paper.md
# Inspect .zpf file metadata
zse rag inspect paper.zpf
# Show store stats
zse rag stats
# Re-embed with a different model
zse rag reindex --model all-MiniLM-L6-v2
# Remove a document
zse rag remove <doc_id>
from zse.core.zrag.pipeline import RAGPipeline
pipeline = RAGPipeline(store_dir="./my_store")
pipeline.ingest("noisy_article.md")
# Get LLM-ready context (token-budgeted)
context = pipeline.get_context("What is transfer learning?", max_tokens=500)
# ~25% fewer tokens than plain chunking, same answer quality
How It Works
- Semantic chunking — splits by content type (11 block types: CODE, TABLE, DEFINITION, PROCEDURE, etc.)
- 10-layer compression — strips filler phrases, verbose patterns, redundant qualifiers, noise lines
- Contextual embedding — each block embeds with parent section hierarchy for cross-section retrieval
- Hybrid retrieval — 0.6× embedding similarity + 0.4× BM25, with size normalization, block-type boosting, and identifier-aware scoring
Multi-GPU: Tensor Parallelism & Pipeline Parallelism
Run models across multiple GPUs to reduce per-GPU VRAM or serve larger models.
# Tensor Parallelism — shard weights across GPUs (NCCL all-reduce)
zse serve qwen-32b -tp 2 --port 8000
# Pipeline Parallelism — split layers across GPUs (NCCL send/recv)
zse serve qwen-72b -pp 4 --port 8000
# Hybrid TP+PP — 2D process grid
zse serve qwen-72b -tp 2 -pp 2 --port 8000
# Auto-detect — let ZSE pick optimal layout
zse serve qwen-72b --multi-gpu --port 8000
Multi-GPU Benchmarks (Qwen2.5-7B INT4, A10G GPUs)
| Config | Speed | VRAM/GPU | vs Single GPU | |--------|-------|----------|---------------| | Single GPU | 23.1 tok/s | 5.22 GB | baseline | | TP=2 | 18.6 tok/s | 3.76 GB | -28% VRAM | | PP=2 | 23.1 tok/s | 2.67 GB | -49% VRAM | | TP=2 × PP=2 | 20.6 tok/s | 1.6-2.1 GB | -65% VRAM |
When to use what:
- TP: Fastest inference, splits every layer. Best when GPUs are connected via NVLink.
- PP: Best VRAM reduction, near-perfect memory split. Works over PCIe.
- TP+PP: Maximum scale — run 72B on 4× consumer GPUs.
Advanced Usage
QLoRA Fine-Tuning (v1.4.0+)
Train any model with QLoRA - LoRA adapters on quantized INT4 base models.
# Install training dependencies
pip install zllm-zse[training]
from zse.format import load_zse_model, convert_model
from zse.training import (
LoRAConfig,
add_lora_to_model,
save_lora_adapter,
load_lora_adapter
)
import torch
# 1. Convert model to .zse (one-time)
convert_model("meta-llama/Llama-3-8B", "llama8b.zse", quantization="int4")
# 2. Load INT4 model
model, tokenizer, info = load_zse_model("llama8b.zse", device="cuda")
# 3. Add LoRA adapters
config = LoRAConfig(
rank=16, # LoRA rank (higher = more capacity)
alpha=32, # LoRA alpha (scaling factor)
dropout=0.05, # Dropout for regularization
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"] # Which layers
)
model = add_lora_to_mode
Related Skills
node-connect
341.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR
