Cyllama
A thin cython wrapper around llama.cpp, whisper.cpp and stable-diffusion.cpp
Install / Use
/learn @shakfu/CyllamaREADME
cyllama - Fast, Pythonic AI Inference
cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art .cpp ecosystem:
- llama.cpp - Text generation, chat, embeddings, and text-to-speech
- whisper.cpp - Speech-to-text transcription and translation
- stable-diffusion.cpp - Image and video generation
It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.
Documentation | PyPI | Changelog
Features
- High-level API --
complete(),chat(),LLMclass for quick prototyping / text generation. - Streaming -- token-by-token output with callbacks
- Batch processing -- process multiple prompts 3-10x faster
- GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform)
- Speculative decoding -- 2-3x speedup with draft models
- Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling
- RAG -- retrieval-augmented generation with local embeddings and SQLite-vector
- Speech recognition -- whisper.cpp transcription and translation
- Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.
- OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer
- Framework integrations -- OpenAI API client, LangChain LLM interface
Installation
From PyPI
pip install cyllama
This installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed, by default, to take advantage of Apple Silicon.
GPU-Accelerated Variants
GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only):
pip install cyllama-cuda12 # NVIDIA GPU (CUDA 12.4)
pip install cyllama-rocm # AMD GPU (ROCm 6.3, requires glibc >= 2.35)
pip install cyllama-sycl # Intel GPU (oneAPI SYCL 2025.3)
pip install cyllama-vulkan # Cross-platform GPU (Vulkan)
All variants install the same cyllama Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.
You can verify which backend is active after installation:
python -m cyllama info
You can also query the backend configuration at runtime:
from cyllama import _backend
print(_backend.cuda) # True if built with CUDA
print(_backend.metal) # True if built with Metal
Build from source with a specific backend
GGML_CUDA=1 pip install cyllama --no-binary cyllama
GGML_VULKAN=1 pip install cyllama --no-binary cyllama
Quick Start
from cyllama import complete
# One line is all you need
response = complete(
"Explain quantum computing in simple terms",
model_path="models/llama.gguf",
temperature=0.7,
max_tokens=200
)
print(response)
Key Features
Simple by Default, Powerful When Needed
High-Level API - Get started in seconds:
from cyllama import complete, chat, LLM
# One-shot completion
response = complete("What is Python?", model_path="model.gguf")
# Multi-turn chat
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")
# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2") # Model stays loaded!
Streaming Support - Real-time token-by-token output:
for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
print(chunk, end="", flush=True)
Performance Optimized
Batch Processing - Process multiple prompts 3-10x faster:
from cyllama import batch_generate
prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")
Speculative Decoding - 2-3x speedup with draft models:
from cyllama.llama.llama_cpp import Speculative, SpeculativeParams
params = SpeculativeParams(n_max=16, p_min=0.75)
spec = Speculative(params, ctx_target)
draft_tokens = spec.draft(prompt_tokens, last_token)
Memory Optimization - Smart GPU layer allocation:
from cyllama import estimate_gpu_layers
estimate = estimate_gpu_layers(
model_path="model.gguf",
available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")
N-gram Cache - 2-10x speedup for repetitive text:
from cyllama.llama.llama_cpp import NgramCache
cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)
Response Caching - Cache LLM responses for repeated prompts:
from cyllama import LLM
# Enable caching with 100 entries and 1 hour TTL
llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)
response1 = llm("What is Python?") # Cache miss - generates response
response2 = llm("What is Python?") # Cache hit - returns cached response instantly
# Check cache statistics
info = llm.cache_info() # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)
# Clear cache when needed
llm.cache_clear()
Note: Caching requires a fixed seed (seed != -1) since random seeds produce non-deterministic output. Streaming responses are not cached.
Framework Integrations
OpenAI-Compatible API - Drop-in replacement:
from cyllama.integrations import OpenAIClient
client = OpenAIClient(model_path="model.gguf")
response = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello!"}],
temperature=0.7
)
print(response.choices[0].message.content)
LangChain Integration - Seamless ecosystem access:
from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain
llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")
Agent Framework
Cyllama includes a zero-dependency agent framework with three agent architectures:
ReActAgent - Reasoning + Acting agent with tool calling:
from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval
@tool
def calculate(expression: str) -> str:
"""Evaluate a math expression safely."""
return str(simple_eval(expression))
llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)
ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:
from cyllama.agents import ConstrainedAgent
agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4") # Guaranteed valid tool calls
ContractAgent - Contract-based agent with C++26-inspired pre/post conditions:
from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy
@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
"""Divide a by x."""
return a / x
agent = ContractAgent(
llm=llm,
tools=[divide],
policy=ContractPolicy.ENFORCE,
task_precondition=lambda task: len(task) > 10,
answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")
See Agents Overview for detailed agent documentation.
Speech Recognition
Whisper Transcription - Transcribe audio files with timestamps:
from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np
# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav") # Your audio loading function
# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)
# Get results
for i in range(ctx.full_n_segments()):
start = ctx.full_get_segment_t0(i) / 100.0
end = ctx.full_get_segment_t1(i) / 100.0
text = ctx.full_get_segment_text(i)
print(f"[{start:.2f}s - {end:.2f}s] {text}")
See Whisper docs for full documentation.
Stable Diffusion
Image Generation - Generate images from text using stable-diffusion.cpp:
from cyllama.sd import text_to_image
# Simple text-to-image
images = text_to_image(
model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
prompt="a photo of a cute cat",
width=512,
height=512,
sample_steps=4,
cfg_scale=1.0
)
images[0].save("output.png")
Advanced Generation - Full control with SDContext:
from cyllama.sd import SDContext, SDContextParams, SampleMethod, Scheduler
params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4
ctx = SDContext(params)
images = ctx.generate(
prompt="a beautiful mountain landscape",
negative_prompt="blurry, ugly",
width=512,
height=512,
sample_method=SampleMethod.EULER,
scheduler=Scheduler.DISCRETE
)
CLI Tool - Command-line interface:
# Text to image
python -m cyllama.sd txt2img \
--model models/sd_xl_turbo_1.0.q8_0.gguf \
--prompt "a beautiful sunset" \
--output sunset.png
# Image to image
python -m cyllama.sd img2img \
--model models/sd-v1-5.gguf \
--init-img input.png \
--prompt "oil painting style" \
--strength 0.7
# Show system info
python -m cyllama.sd info
Supports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See [Stable Diffusion docs](docs/stable
