Llmedge

Android native AI inference library, bringing gguf models and stable-diffusion inference on android devices, powered by llama.cpp and stable-diffusion.cpp

Generate Convert Improve

Install / Use

/learn @Aatricks/Llmedge

About this skill

Quality Score

0/100

README

llmedge

llmedge is a lightweight Android library for running GGUF language models fully on-device, powered by llama.cpp.

See the examples repository for sample usage.

Acknowledgments to Shubham Panchal and upstream projects are listed in CREDITS.md.

[!NOTE] This library is in early development and may change significantly.

[!IMPORTANT] API maturity is uneven by feature area. LLMEdge, text inference, speech inference, and model management are the most stable entry points today. OCR via edge.vision.extractText(...) is also reliable. Vision/VLM analysis, RAG, and some image/video-generation flows are available and tested, but should still be treated as evolving APIs.

Features

LLM Inference: Run GGUF models directly on Android using llama.cpp (JNI)
Model Downloads: Download and cache models from Hugging Face Hub
Optimized Inference: Native KV cache reuse for compact chats, default batched blocking and streaming text generation, separate prompt vs generation thread tuning, and Kotlin-managed ChatSession replay for reasoning-heavy models
Speech-to-Text (STT): Whisper.cpp integration with timestamp support, language detection, streaming transcription, and SRT generation
Text-to-Speech (TTS): Bark.cpp integration with ARM optimizations
Image Generation: Stable Diffusion with EasyCache and LoRA support
Video Generation: Wan 2.1 models (4-64 frames) with sequential loading
On-device RAG: PDF indexing, embeddings, vector search, Q&A
OCR: Google ML Kit text extraction
Memory Metrics: Built-in RAM usage monitoring
Vision Models: Architecture prepared for LLaVA-style models (requires specific model formats)
GPU Acceleration: Optional Android GPU backends for text, Whisper, and image/video with experimental OpenCL preferred first, Vulkan fallback second, and CPU fallback last

Installation
Usage
Building
Architecture
Technologies
Memory Metrics
Notes
Testing

Installation

[!WARNING] For development, Linux is strongly recommended for GPU-enabled builds. The Vulkan shader-generation path used by Stable Diffusion is still unreliable on Windows cross-builds.

Clone the repository along with the llama.cpp and stable-diffusion.cpp submodule:

git clone --depth=1 https://github.com/Aatricks/llmedge
cd llmedge
git submodule update --init --recursive

Open the project in Android Studio. If it does not build automatically, use Build > Rebuild Project.

Usage

Quick Start

The recommended entry point is the instance-based LLMEdge facade. It exposes domain clients for text, speech, image generation, vision, and RAG while keeping model resolution and resource ownership explicit.

val edge = LLMEdge.create(
    context = context,
    scope = viewModelScope,
)

viewModelScope.launch {
    val reply = edge.text.generate(
        prompt = "Summarize on-device LLMs in one sentence.",
    )
    outputView.text = reply
}

Low-level wrappers like SmolLM, StableDiffusion, Whisper, and BarkTTS remain available for expert workflows, but new code should prefer LLMEdge.

By default, edge.text.generate(...) uses batched native decoding for lower JNI overhead, while edge.text.stream(...) uses smaller batched chunks so UI updates stay responsive without paying a JNI crossing per token.

Downloading Models

llmedge can resolve and cache model weights independently of inference:

val edge = LLMEdge.create(context, viewModelScope)

val modelFile = edge.models.prefetch(
    ModelSpec.huggingFace(
        repoId = "unsloth/Qwen3-0.6B-GGUF",
        filename = "Qwen3-0.6B-Q4_K_M.gguf",
    ),
)

Log.d("llmedge", "Cached ${modelFile.name} at ${modelFile.parent}")

Key points:

ModelManager.prefetch() downloads (if needed) without coupling the file to one inference class.
Supports progress callbacks and private repositories via token through ModelSpec.huggingFace(...).
Requests to old mirrors automatically resolve to up-to-date Hugging Face repos.
Automatically uses the model's declared context window (minimum 1K tokens) and caps it to a heap-aware limit (2K–8K). Override with InferenceParams(contextSize = …) if needed.
Large downloads use Android's DownloadManager when preferSystemDownloader = true to keep transfers out of the Dalvik heap.
Advanced users can still call HuggingFaceHub.ensureModelOnDisk() directly when they want full control.

Reasoning Controls

SmolLM lets you disable or re-enable "thinking" traces produced by reasoning-aware models through the ThinkingMode enum and the optional reasoningBudget parameter. The default configuration keeps thinking enabled (ThinkingMode.DEFAULT, reasoning budget -1). To start a session with thinking disabled (equivalent to passing --no-think or --reasoning-budget 0), specify it when loading the model:

val smol = SmolLM()

val params = SmolLM.InferenceParams(
    thinkingMode = SmolLM.ThinkingMode.DISABLED,
    reasoningBudget = 0, // explicit override, optional when the mode is DISABLED
)
smol.load(modelPath, params)

At runtime you can flip the behaviour without reloading the model:

smol.setThinkingEnabled(true)              // restore the default
smol.setReasoningBudget(0)                 // force-disable thoughts again
val budget = smol.getReasoningBudget()     // inspect the current budget
val mode = smol.getThinkingMode()          // inspect the current mode

Setting the budget to 0 always disables thinking, while -1 leaves it unrestricted. If you omit reasoningBudget, the library chooses 0 when the mode is DISABLED and -1 otherwise. The API also injects the /no_think tag automatically when thinking is disabled, so you do not need to modify prompts manually.

Managed Chat Sessions

Use edge.text.session(...) when you want bounded multi-turn chat without exposing native storeChats state to application code.

val edge = LLMEdge.create(context, viewModelScope)

val session = edge.text.session(
    memory = ConversationWindow(
        maxTurns = 6,
        maxTokens = 4096,
        stripThinkTags = true,
    ),
    systemPrompt = "You are a concise assistant.",
)

viewModelScope.launch {
    session.prepare()
    val reply = session.reply("Explain why context windows fill up.")
    session.stream("Now summarize that in 3 bullets.").collect { event ->
        when (event) {
            is TextStreamEvent.Chunk -> print(event.value)
            is TextStreamEvent.Completed -> println(event.fullText)
            else -> Unit
        }
    }
}

The new session API keeps transcript state in Kotlin, applies sliding-window trimming, and strips replayed <think>...</think> blocks by default so reasoning-heavy models do not exhaust the context window as quickly.

Tool Calling

Use edge.text.toolAgent(...) when you want the model to call app-defined tools. Read-only tools execute automatically; action tools require an explicit policy decision.

val edge = LLMEdge.create(context, viewModelScope)
val factory = DeviceToolFactory(context)

val agent = edge.text.toolAgent(
    tools = factory.createDefaultTools(),
    systemPrompt = "Be concise and only use tools when needed.",
    policy = ToolPolicies.ALLOW_ALL, // or keep the default to deny action tools
)

viewModelScope.launch {
    val result = agent.reply("What time is it and how much battery is left?")
    println(result.text)

    agent.stream("Open https://example.com").collect { event ->
        when (event) {
            is ToolAgentEvent.ToolCallRequested -> println("Tool: ${event.call.tool}")
            is ToolAgentEvent.TextChunk -> print(event.value)
            is ToolAgentEvent.Completed -> println("\nDone: ${event.result.finishReason}")
            else -> Unit
        }
    }
}

Tool calls use a structured JSON envelope internally: {"tool":"name","arguments":{...}}. The parser also accepts the legacy tool_name field for robustness, but new prompts only emit the tool shape.

Text Generation Performance Tuning

The text stack now separates prompt/batch processing from single-token generation so you can tune the two phases independently:

val edge = LLMEdge.create(
    context = context,
    scope = viewModelScope,
    config = LLMEdgeConfig(
        defaultTextThreads = 6,            // prompt/batch phase
        defaultTextGenerationThreads = 2,  // token-by-token phase
        defaultTextBatchSize = 8,
        defaultTextStreamBatchSize = 4,
        textCacheMemoryMb = 1536,
    ),
)

val reply = edge.text.generate(
    prompt = "Explain speculative decoding.",
    options = TextModelOptions(numThreads = 8, generationThreads = 3),
    batchSize = 12,
)

Practical defaults:

defaultTextThreads: prompt/batch decode threads
defaultTextGenerationThreads: single-token generation threads
defaultTextBatchSize: blocking text batch size (default 8)
defaultTextStreamBatchSize: streaming batch size (default 4)
`textCacheMemory

Related Skills

qqbot-channel

348.0k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.2k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

348.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

Aatricks

View profile

View on GitHub

GitHub Stars41

CategoryContent

Updated4d ago

Forks6

Aatricks/llmedge

Languages

Kotlin

Security Score

95/100

Audited on Mar 31, 2026

No findings