<p align="center"> <a href="https://zllm.in"> <img src="https://zllm.in/_next/image?url=%2Fimages%2Fzllm-logo.png&w=256&q=75" alt="ZLLM Logo" width="128"> </a> </p> <h1 align="center">ZSE - Z Server Engine</h1> <p align="center"> <a href="https://pypi.org/project/zllm-zse/"><img src="https://img.shields.io/pypi/v/zllm-zse.svg" alt="PyPI"></a> <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.11+-blue.svg" alt="Python 3.11+"></a> <a href="LICENSE"><img src="https://img.shields.io/badge/license-Apache%202.0-green.svg" alt="License"></a> <a href="https://zllm.in"><img src="https://img.shields.io/badge/website-zllm.in-blue" alt="Website"></a> </p> <p align="center"> <a href="https://railway.app/template?repo=https://github.com/Zyora-Dev/zse"><img src="https://railway.app/button.svg" alt="Deploy on Railway"></a> <a href="https://render.com/deploy?repo=https://github.com/Zyora-Dev/zse"><img src="https://render.com/images/deploy-to-render-button.svg" alt="Deploy to Render"></a> </p>

Ultra memory-efficient LLM inference engine with native INT4 CUDA kernels.

Run 32B models on 24GB GPUs. Run 7B models on 8GB GPUs. Fast cold starts, single-file deployment.

🆕 v1.4.1: Model Hub + Pull Commands

Train 7B models on 8GB GPUs. Train 70B models on 48GB GPUs.

from zse.format import load_zse_model
from zse.training import LoRAConfig, add_lora_to_model, save_lora_adapter

# Load INT4 model (uses ~6GB for 7B)
model, tokenizer, info = load_zse_model("model.zse", device="cuda")

# Add LoRA adapters (~1% trainable params)
config = LoRAConfig(rank=16, alpha=32)
model = add_lora_to_model(model, config)

# Train as usual with PyTorch/HuggingFace
# ... your training loop ...

# Save adapter (tiny file, ~25MB for 7B)
save_lora_adapter(model, "my_adapter.safetensors")

Install training dependencies: pip install zllm-zse[training]

🚀 Benchmarks (Verified, March 2026)

ZSE Custom Kernel (Default)

| Model | File Size | VRAM | Speed | Cold Start | GPU | |-------|-----------|------|-------|------------|-----| | Qwen 7B | 5.57 GB | 5.67 GB | 37.2 tok/s | 5.7s | H200 | | Qwen 14B | 9.95 GB | 10.08 GB | 20.8 tok/s | 10.5s | H200 | | Qwen 32B | 19.23 GB | 19.47 GB | 10.9 tok/s | 20.4s | H200 | | Qwen 72B | 41.21 GB | 41.54 GB | 6.3 tok/s | 51.8s | H200 |

ZSE bnb Backend (Alternate)

| Model | VRAM | Speed | Cold Start | |-------|------|-------|------------| | Qwen 7B | 6.57 GB | 45.6 tok/s | 6.0s | | Qwen 14B | 11.39 GB | 27.6 tok/s | 7.1s | | Qwen 32B | 22.27 GB | 20.4 tok/s | 20.8s | | Qwen 72B | 47.05 GB | 16.4 tok/s | 53.0s |

VRAM Comparison

| Model | Custom Kernel | bnb Backend | Savings | |-------|---------------|-------------|----------| | 7B | 5.67 GB | 6.57 GB | -0.90 GB (14%) | | 14B | 10.08 GB | 11.39 GB | -1.31 GB (12%) | | 32B | 19.47 GB | 22.27 GB | -2.80 GB (13%) | | 72B | 41.54 GB | 47.05 GB | -5.51 GB (12%) |

GPU Compatibility

| GPU | VRAM | Max Model | |-----|------|-----------| | RTX 3070/4070 | 8GB | 7B | | RTX 3080 | 12GB | 14B | | RTX 3090/4090 | 24GB | 32B | | A100-40GB | 40GB | 32B | | A100-80GB / H200 | 80-141GB | 72B |

Key Features

📦 Single .zse File: Model + tokenizer + config in one file
🚫 No Network Calls: Everything embedded, works offline
⚡ ZSE Custom Kernel: Native INT4 inference with maximum VRAM efficiency
🧠 Memory Efficient: 72B in 41GB, 32B in 19GB, 7B in 5.7GB VRAM
🏃 Fast Cold Start: 5.7s for 7B, 20s for 32B, 52s for 72B
🎯 Dual Backend: Custom kernel (default) or bnb backend (alternate)
🔥 QLoRA Training: Fine-tune INT4 models with LoRA adapters (NEW in v1.4.0)
📄 Built-in RAG with .zpf: 25% fewer LLM tokens at 100% accuracy vs plain chunking

Installation

pip install zllm-zse

Requirements:

Python 3.11+
CUDA GPU (8GB+ VRAM recommended)
bitsandbytes (auto-installed)

Quick Start

1. Pull a Pre-Converted Model (Fastest)

# Download ready-to-use .zse model (no GPU needed for conversion)
zse pull qwen-7b          # 5.18 GB download
zse pull mistral-7b       # 3.86 GB download
zse pull qwen-0.5b        # 0.69 GB download

# Browse available models
zse list
zse list --cached

2. Or Convert Model to .zse Format (One-Time)

# Convert any HuggingFace model
zse convert Qwen/Qwen2.5-7B-Instruct -o qwen7b.zse
zse convert Qwen/Qwen2.5-32B-Instruct -o qwen32b.zse

# Or in Python
from zse.format.writer import convert_model
convert_model("Qwen/Qwen2.5-7B-Instruct", "qwen7b.zse", quantization="int4")

2. Load and Run

from zse.format.reader_v2 import load_zse_model

# Load model (auto-detects optimal settings)
model, tokenizer, info = load_zse_model("qwen7b.zse")

# Generate
inputs = tokenizer("Write a poem about AI:", return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

3. Start Server (OpenAI-Compatible)

zse serve qwen7b.zse --port 8000

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="zse")
response = client.chat.completions.create(
    model="qwen7b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

.zse Format Benefits

| Feature | HuggingFace | .zse Format | |---------|-------------|-------------| | Cold start (7B) | 45s | 9s | | Cold start (32B) | 120s | 24s | | Network calls on load | Yes | No | | Files to manage | Many | One | | Quantization time | Runtime | Pre-done |

.zpf: Built-in RAG with Token Cost Reduction

.zpf delivers 100% retrieval accuracy at 25% lower LLM token cost compared to plain chunking — tested on real noisy web content.

.zpf (Z Packed Format) is ZSE's built-in semantic document format for RAG. It compresses documents at write-time, stripping noise (cookie banners, nav, boilerplate, filler prose) while preserving what the LLM needs.

Cost Benchmark (CNN noisy web article)

| Metric | .zpf | Plain Chunking | |--------|------|----------------| | Correct answers | 10/10 | 10/10 | | Tokens sent to LLM (avg/query) | 943 | 1,257 | | Token reduction | 25% | — | | Cost per 1M queries (GPT-4o) | $2,357 | $3,143 | | Annual savings at 1M queries | $786 | — | | Break-even | 1 query per document | — |

Quick Start

# Ingest a document into RAG store
zse rag add paper.pdf --title "ML Paper"

# Semantic search
zse rag search "What is batch normalization?" -k 5

# Get token-budgeted LLM context
zse rag context "What is batch normalization?" --max-tokens 500

# List all documents
zse rag list

# Export to open formats (zero vendor lock-in)
zse rag export paper.zpf -f markdown -o paper.md

# Inspect .zpf file metadata
zse rag inspect paper.zpf

# Show store stats
zse rag stats

# Re-embed with a different model
zse rag reindex --model all-MiniLM-L6-v2

# Remove a document
zse rag remove <doc_id>

from zse.core.zrag.pipeline import RAGPipeline

pipeline = RAGPipeline(store_dir="./my_store")
pipeline.ingest("noisy_article.md")

# Get LLM-ready context (token-budgeted)
context = pipeline.get_context("What is transfer learning?", max_tokens=500)
# ~25% fewer tokens than plain chunking, same answer quality

How It Works

Semantic chunking — splits by content type (11 block types: CODE, TABLE, DEFINITION, PROCEDURE, etc.)
10-layer compression — strips filler phrases, verbose patterns, redundant qualifiers, noise lines
Contextual embedding — each block embeds with parent section hierarchy for cross-section retrieval
Hybrid retrieval — 0.6× embedding similarity + 0.4× BM25, with size normalization, block-type boosting, and identifier-aware scoring

Multi-GPU: Tensor Parallelism & Pipeline Parallelism

Run models across multiple GPUs to reduce per-GPU VRAM or serve larger models.

# Tensor Parallelism — shard weights across GPUs (NCCL all-reduce)
zse serve qwen-32b -tp 2 --port 8000

# Pipeline Parallelism — split layers across GPUs (NCCL send/recv)
zse serve qwen-72b -pp 4 --port 8000

# Hybrid TP+PP — 2D process grid
zse serve qwen-72b -tp 2 -pp 2 --port 8000

# Auto-detect — let ZSE pick optimal layout
zse serve qwen-72b --multi-gpu --port 8000

Multi-GPU Benchmarks (Qwen2.5-7B INT4, A10G GPUs)

| Config | Speed | VRAM/GPU | vs Single GPU | |--------|-------|----------|---------------| | Single GPU | 23.1 tok/s | 5.22 GB | baseline | | TP=2 | 18.6 tok/s | 3.76 GB | -28% VRAM | | PP=2 | 23.1 tok/s | 2.67 GB | -49% VRAM | | TP=2 × PP=2 | 20.6 tok/s | 1.6-2.1 GB | -65% VRAM |

When to use what:

TP: Fastest inference, splits every layer. Best when GPUs are connected via NVLink.
PP: Best VRAM reduction, near-perfect memory split. Works over PCIe.
TP+PP: Maximum scale — run 72B on 4× consumer GPUs.

Advanced Usage

QLoRA Fine-Tuning (v1.4.0+)

Train any model with QLoRA - LoRA adapters on quantized INT4 base models.

# Install training dependencies
pip install zllm-zse[training]

from zse.format import load_zse_model, convert_model
from zse.training import (
    LoRAConfig, 
    add_lora_to_model, 
    save_lora_adapter,
    load_lora_adapter
)
import torch

# 1. Convert model to .zse (one-time)
convert_model("meta-llama/Llama-3-8B", "llama8b.zse", quantization="int4")

# 2. Load INT4 model
model, tokenizer, info = load_zse_model("llama8b.zse", device="cuda")

# 3. Add LoRA adapters
config = LoRAConfig(
    rank=16,              # LoRA rank (higher = more capacity)
    alpha=32,             # LoRA alpha (scaling factor)
    dropout=0.05,         # Dropout for regularization
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"]  # Which layers
)
model = add_lora_to_mode

Zse

Install / Use

README

🆕 v1.4.1: Model Hub + Pull Commands

🚀 Benchmarks (Verified, March 2026)

ZSE Custom Kernel (Default)

ZSE bnb Backend (Alternate)

VRAM Comparison

GPU Compatibility

Key Features

Installation

Quick Start

1. Pull a Pre-Converted Model (Fastest)

2. Or Convert Model to .zse Format (One-Time)

2. Load and Run

3. Start Server (OpenAI-Compatible)

.zse Format Benefits

.zpf: Built-in RAG with Token Cost Reduction

Cost Benchmark (CNN noisy web article)

Quick Start

How It Works

Multi-GPU: Tensor Parallelism & Pipeline Parallelism

Multi-GPU Benchmarks (Qwen2.5-7B INT4, A10G GPUs)

Advanced Usage

QLoRA Fine-Tuning (v1.4.0+)

Related Skills