Llmedge
Android native AI inference library, bringing gguf models and stable-diffusion inference on android devices, powered by llama.cpp and stable-diffusion.cpp
Install / Use
/learn @Aatricks/LlmedgeREADME
llmedge
llmedge is a lightweight Android library for running GGUF language models fully on-device, powered by llama.cpp.
See the examples repository for sample usage.
Acknowledgments to Shubham Panchal and upstream projects are listed in CREDITS.md.
[!NOTE] This library is in early development and may change significantly.
[!IMPORTANT] API maturity is uneven by feature area.
LLMEdge, text inference, speech inference, and model management are the most stable entry points today. OCR viaedge.vision.extractText(...)is also reliable. Vision/VLM analysis, RAG, and some image/video-generation flows are available and tested, but should still be treated as evolving APIs.
Features
- LLM Inference: Run GGUF models directly on Android using llama.cpp (JNI)
- Model Downloads: Download and cache models from Hugging Face Hub
- Optimized Inference: Native KV cache reuse for compact chats, default batched blocking and streaming text generation, separate prompt vs generation thread tuning, and Kotlin-managed
ChatSessionreplay for reasoning-heavy models - Speech-to-Text (STT): Whisper.cpp integration with timestamp support, language detection, streaming transcription, and SRT generation
- Text-to-Speech (TTS): Bark.cpp integration with ARM optimizations
- Image Generation: Stable Diffusion with EasyCache and LoRA support
- Video Generation: Wan 2.1 models (4-64 frames) with sequential loading
- On-device RAG: PDF indexing, embeddings, vector search, Q&A
- OCR: Google ML Kit text extraction
- Memory Metrics: Built-in RAM usage monitoring
- Vision Models: Architecture prepared for LLaVA-style models (requires specific model formats)
- GPU Acceleration: Optional Android GPU backends for text, Whisper, and image/video with experimental OpenCL preferred first, Vulkan fallback second, and CPU fallback last
Table of Contents
Installation
[!WARNING] For development, Linux is strongly recommended for GPU-enabled builds. The Vulkan shader-generation path used by Stable Diffusion is still unreliable on Windows cross-builds.
Clone the repository along with the llama.cpp and stable-diffusion.cpp submodule:
git clone --depth=1 https://github.com/Aatricks/llmedge
cd llmedge
git submodule update --init --recursive
Open the project in Android Studio. If it does not build automatically, use Build > Rebuild Project.
Usage
Quick Start
The recommended entry point is the instance-based LLMEdge facade. It exposes domain clients for text, speech, image generation, vision, and RAG while keeping model resolution and resource ownership explicit.
val edge = LLMEdge.create(
context = context,
scope = viewModelScope,
)
viewModelScope.launch {
val reply = edge.text.generate(
prompt = "Summarize on-device LLMs in one sentence.",
)
outputView.text = reply
}
Low-level wrappers like SmolLM, StableDiffusion, Whisper, and BarkTTS remain available for expert workflows, but new code should prefer LLMEdge.
By default, edge.text.generate(...) uses batched native decoding for lower JNI overhead, while
edge.text.stream(...) uses smaller batched chunks so UI updates stay responsive without paying a
JNI crossing per token.
Downloading Models
llmedge can resolve and cache model weights independently of inference:
val edge = LLMEdge.create(context, viewModelScope)
val modelFile = edge.models.prefetch(
ModelSpec.huggingFace(
repoId = "unsloth/Qwen3-0.6B-GGUF",
filename = "Qwen3-0.6B-Q4_K_M.gguf",
),
)
Log.d("llmedge", "Cached ${modelFile.name} at ${modelFile.parent}")
Key points:
-
ModelManager.prefetch()downloads (if needed) without coupling the file to one inference class. -
Supports progress callbacks and private repositories via token through
ModelSpec.huggingFace(...). -
Requests to old mirrors automatically resolve to up-to-date Hugging Face repos.
-
Automatically uses the model's declared context window (minimum 1K tokens) and caps it to a heap-aware limit (2K–8K). Override with
InferenceParams(contextSize = …)if needed. -
Large downloads use Android's DownloadManager when
preferSystemDownloader = trueto keep transfers out of the Dalvik heap. -
Advanced users can still call
HuggingFaceHub.ensureModelOnDisk()directly when they want full control.
Reasoning Controls
SmolLM lets you disable or re-enable "thinking" traces produced by reasoning-aware models through the ThinkingMode enum and the optional reasoningBudget parameter. The default configuration keeps thinking enabled (ThinkingMode.DEFAULT, reasoning budget -1). To start a session with thinking disabled (equivalent to passing --no-think or --reasoning-budget 0), specify it when loading the model:
val smol = SmolLM()
val params = SmolLM.InferenceParams(
thinkingMode = SmolLM.ThinkingMode.DISABLED,
reasoningBudget = 0, // explicit override, optional when the mode is DISABLED
)
smol.load(modelPath, params)
At runtime you can flip the behaviour without reloading the model:
smol.setThinkingEnabled(true) // restore the default
smol.setReasoningBudget(0) // force-disable thoughts again
val budget = smol.getReasoningBudget() // inspect the current budget
val mode = smol.getThinkingMode() // inspect the current mode
Setting the budget to 0 always disables thinking, while -1 leaves it unrestricted. If you omit reasoningBudget, the library chooses 0 when the mode is DISABLED and -1 otherwise. The API also injects the /no_think tag automatically when thinking is disabled, so you do not need to modify prompts manually.
Managed Chat Sessions
Use edge.text.session(...) when you want bounded multi-turn chat without exposing native storeChats state to application code.
val edge = LLMEdge.create(context, viewModelScope)
val session = edge.text.session(
memory = ConversationWindow(
maxTurns = 6,
maxTokens = 4096,
stripThinkTags = true,
),
systemPrompt = "You are a concise assistant.",
)
viewModelScope.launch {
session.prepare()
val reply = session.reply("Explain why context windows fill up.")
session.stream("Now summarize that in 3 bullets.").collect { event ->
when (event) {
is TextStreamEvent.Chunk -> print(event.value)
is TextStreamEvent.Completed -> println(event.fullText)
else -> Unit
}
}
}
The new session API keeps transcript state in Kotlin, applies sliding-window trimming, and strips replayed <think>...</think> blocks by default so reasoning-heavy models do not exhaust the context window as quickly.
Tool Calling
Use edge.text.toolAgent(...) when you want the model to call app-defined tools. Read-only tools execute automatically; action tools require an explicit policy decision.
val edge = LLMEdge.create(context, viewModelScope)
val factory = DeviceToolFactory(context)
val agent = edge.text.toolAgent(
tools = factory.createDefaultTools(),
systemPrompt = "Be concise and only use tools when needed.",
policy = ToolPolicies.ALLOW_ALL, // or keep the default to deny action tools
)
viewModelScope.launch {
val result = agent.reply("What time is it and how much battery is left?")
println(result.text)
agent.stream("Open https://example.com").collect { event ->
when (event) {
is ToolAgentEvent.ToolCallRequested -> println("Tool: ${event.call.tool}")
is ToolAgentEvent.TextChunk -> print(event.value)
is ToolAgentEvent.Completed -> println("\nDone: ${event.result.finishReason}")
else -> Unit
}
}
}
Tool calls use a structured JSON envelope internally: {"tool":"name","arguments":{...}}. The parser also accepts the legacy tool_name field for robustness, but new prompts only emit the tool shape.
Text Generation Performance Tuning
The text stack now separates prompt/batch processing from single-token generation so you can tune the two phases independently:
val edge = LLMEdge.create(
context = context,
scope = viewModelScope,
config = LLMEdgeConfig(
defaultTextThreads = 6, // prompt/batch phase
defaultTextGenerationThreads = 2, // token-by-token phase
defaultTextBatchSize = 8,
defaultTextStreamBatchSize = 4,
textCacheMemoryMb = 1536,
),
)
val reply = edge.text.generate(
prompt = "Explain speculative decoding.",
options = TextModelOptions(numThreads = 8, generationThreads = 3),
batchSize = 12,
)
Practical defaults:
defaultTextThreads: prompt/batch decode threadsdefaultTextGenerationThreads: single-token generation threadsdefaultTextBatchSize: blocking text batch size (default8)defaultTextStreamBatchSize: streaming batch size (default4)- `textCacheMemory
Related Skills
qqbot-channel
348.0kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.2k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
348.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
