SkillAgentSearch skills...

Llmedge

Android native AI inference library, bringing gguf models and stable-diffusion inference on android devices, powered by llama.cpp and stable-diffusion.cpp

Install / Use

/learn @Aatricks/Llmedge

README

llmedge

llmedge is a lightweight Android library for running GGUF language models fully on-device, powered by llama.cpp.

See the examples repository for sample usage.

Acknowledgments to Shubham Panchal and upstream projects are listed in CREDITS.md.

[!NOTE] This library is in early development and may change significantly.

[!IMPORTANT] API maturity is uneven by feature area. LLMEdge, text inference, speech inference, and model management are the most stable entry points today. OCR via edge.vision.extractText(...) is also reliable. Vision/VLM analysis, RAG, and some image/video-generation flows are available and tested, but should still be treated as evolving APIs.


Features

  • LLM Inference: Run GGUF models directly on Android using llama.cpp (JNI)
  • Model Downloads: Download and cache models from Hugging Face Hub
  • Optimized Inference: Native KV cache reuse for compact chats, default batched blocking and streaming text generation, separate prompt vs generation thread tuning, and Kotlin-managed ChatSession replay for reasoning-heavy models
  • Speech-to-Text (STT): Whisper.cpp integration with timestamp support, language detection, streaming transcription, and SRT generation
  • Text-to-Speech (TTS): Bark.cpp integration with ARM optimizations
  • Image Generation: Stable Diffusion with EasyCache and LoRA support
  • Video Generation: Wan 2.1 models (4-64 frames) with sequential loading
  • On-device RAG: PDF indexing, embeddings, vector search, Q&A
  • OCR: Google ML Kit text extraction
  • Memory Metrics: Built-in RAM usage monitoring
  • Vision Models: Architecture prepared for LLaVA-style models (requires specific model formats)
  • GPU Acceleration: Optional Android GPU backends for text, Whisper, and image/video with experimental OpenCL preferred first, Vulkan fallback second, and CPU fallback last

Table of Contents

  1. Installation
  2. Usage
  3. Building
  4. Architecture
  5. Technologies
  6. Memory Metrics
  7. Notes
  8. Testing

Installation

[!WARNING] For development, Linux is strongly recommended for GPU-enabled builds. The Vulkan shader-generation path used by Stable Diffusion is still unreliable on Windows cross-builds.

Clone the repository along with the llama.cpp and stable-diffusion.cpp submodule:

git clone --depth=1 https://github.com/Aatricks/llmedge
cd llmedge
git submodule update --init --recursive

Open the project in Android Studio. If it does not build automatically, use Build > Rebuild Project.

Usage

Quick Start

The recommended entry point is the instance-based LLMEdge facade. It exposes domain clients for text, speech, image generation, vision, and RAG while keeping model resolution and resource ownership explicit.

val edge = LLMEdge.create(
    context = context,
    scope = viewModelScope,
)

viewModelScope.launch {
    val reply = edge.text.generate(
        prompt = "Summarize on-device LLMs in one sentence.",
    )
    outputView.text = reply
}

Low-level wrappers like SmolLM, StableDiffusion, Whisper, and BarkTTS remain available for expert workflows, but new code should prefer LLMEdge.

By default, edge.text.generate(...) uses batched native decoding for lower JNI overhead, while edge.text.stream(...) uses smaller batched chunks so UI updates stay responsive without paying a JNI crossing per token.

Downloading Models

llmedge can resolve and cache model weights independently of inference:

val edge = LLMEdge.create(context, viewModelScope)

val modelFile = edge.models.prefetch(
    ModelSpec.huggingFace(
        repoId = "unsloth/Qwen3-0.6B-GGUF",
        filename = "Qwen3-0.6B-Q4_K_M.gguf",
    ),
)

Log.d("llmedge", "Cached ${modelFile.name} at ${modelFile.parent}")

Key points:

  • ModelManager.prefetch() downloads (if needed) without coupling the file to one inference class.

  • Supports progress callbacks and private repositories via token through ModelSpec.huggingFace(...).

  • Requests to old mirrors automatically resolve to up-to-date Hugging Face repos.

  • Automatically uses the model's declared context window (minimum 1K tokens) and caps it to a heap-aware limit (2K–8K). Override with InferenceParams(contextSize = …) if needed.

  • Large downloads use Android's DownloadManager when preferSystemDownloader = true to keep transfers out of the Dalvik heap.

  • Advanced users can still call HuggingFaceHub.ensureModelOnDisk() directly when they want full control.

Reasoning Controls

SmolLM lets you disable or re-enable "thinking" traces produced by reasoning-aware models through the ThinkingMode enum and the optional reasoningBudget parameter. The default configuration keeps thinking enabled (ThinkingMode.DEFAULT, reasoning budget -1). To start a session with thinking disabled (equivalent to passing --no-think or --reasoning-budget 0), specify it when loading the model:

val smol = SmolLM()

val params = SmolLM.InferenceParams(
    thinkingMode = SmolLM.ThinkingMode.DISABLED,
    reasoningBudget = 0, // explicit override, optional when the mode is DISABLED
)
smol.load(modelPath, params)

At runtime you can flip the behaviour without reloading the model:

smol.setThinkingEnabled(true)              // restore the default
smol.setReasoningBudget(0)                 // force-disable thoughts again
val budget = smol.getReasoningBudget()     // inspect the current budget
val mode = smol.getThinkingMode()          // inspect the current mode

Setting the budget to 0 always disables thinking, while -1 leaves it unrestricted. If you omit reasoningBudget, the library chooses 0 when the mode is DISABLED and -1 otherwise. The API also injects the /no_think tag automatically when thinking is disabled, so you do not need to modify prompts manually.

Managed Chat Sessions

Use edge.text.session(...) when you want bounded multi-turn chat without exposing native storeChats state to application code.

val edge = LLMEdge.create(context, viewModelScope)

val session = edge.text.session(
    memory = ConversationWindow(
        maxTurns = 6,
        maxTokens = 4096,
        stripThinkTags = true,
    ),
    systemPrompt = "You are a concise assistant.",
)

viewModelScope.launch {
    session.prepare()
    val reply = session.reply("Explain why context windows fill up.")
    session.stream("Now summarize that in 3 bullets.").collect { event ->
        when (event) {
            is TextStreamEvent.Chunk -> print(event.value)
            is TextStreamEvent.Completed -> println(event.fullText)
            else -> Unit
        }
    }
}

The new session API keeps transcript state in Kotlin, applies sliding-window trimming, and strips replayed <think>...</think> blocks by default so reasoning-heavy models do not exhaust the context window as quickly.

Tool Calling

Use edge.text.toolAgent(...) when you want the model to call app-defined tools. Read-only tools execute automatically; action tools require an explicit policy decision.

val edge = LLMEdge.create(context, viewModelScope)
val factory = DeviceToolFactory(context)

val agent = edge.text.toolAgent(
    tools = factory.createDefaultTools(),
    systemPrompt = "Be concise and only use tools when needed.",
    policy = ToolPolicies.ALLOW_ALL, // or keep the default to deny action tools
)

viewModelScope.launch {
    val result = agent.reply("What time is it and how much battery is left?")
    println(result.text)

    agent.stream("Open https://example.com").collect { event ->
        when (event) {
            is ToolAgentEvent.ToolCallRequested -> println("Tool: ${event.call.tool}")
            is ToolAgentEvent.TextChunk -> print(event.value)
            is ToolAgentEvent.Completed -> println("\nDone: ${event.result.finishReason}")
            else -> Unit
        }
    }
}

Tool calls use a structured JSON envelope internally: {"tool":"name","arguments":{...}}. The parser also accepts the legacy tool_name field for robustness, but new prompts only emit the tool shape.

Text Generation Performance Tuning

The text stack now separates prompt/batch processing from single-token generation so you can tune the two phases independently:

val edge = LLMEdge.create(
    context = context,
    scope = viewModelScope,
    config = LLMEdgeConfig(
        defaultTextThreads = 6,            // prompt/batch phase
        defaultTextGenerationThreads = 2,  // token-by-token phase
        defaultTextBatchSize = 8,
        defaultTextStreamBatchSize = 4,
        textCacheMemoryMb = 1536,
    ),
)

val reply = edge.text.generate(
    prompt = "Explain speculative decoding.",
    options = TextModelOptions(numThreads = 8, generationThreads = 3),
    batchSize = 12,
)

Practical defaults:

  • defaultTextThreads: prompt/batch decode threads
  • defaultTextGenerationThreads: single-token generation threads
  • defaultTextBatchSize: blocking text batch size (default 8)
  • defaultTextStreamBatchSize: streaming batch size (default 4)
  • `textCacheMemory

Related Skills

View on GitHub
GitHub Stars41
CategoryContent
Updated4d ago
Forks6

Languages

Kotlin

Security Score

95/100

Audited on Mar 31, 2026

No findings