llama.rn

React Native binding of llama.cpp - LLM inference in C/C++

Key Features:

GPU/NPU Acceleration: Metal (iOS), Hexagon NPU (Android, Experimental) for on-device inference
Multimodal Support: Support vision/audio understanding models through mmproj projector integration
Parallel Decoding: Slot-based concurrent request processing with automatic queue management
Tool Calling: Universal function calling support via Jinja templates
Grammar Sampling: GBNF and JSON schema support for structured, constrained output generation

[!IMPORTANT] Starting with v0.10, llama.rn requires React Native's New Architecture.

For Old Architecture support or documentation for v0.9.x, please refer to the v0.9 branch.

Installation

npm install llama.rn

iOS

Please re-run npx pod-install again.

By default, llama.rn will use pre-built rnllama.xcframework for iOS. If you want to build from source, please set RNLLAMA_BUILD_FROM_SOURCE to 1 in your Podfile.

Android

Add proguard rule if it's enabled in project (android/app/proguard-rules.pro):

# llama.rn
-keep class com.rnllama.** { *; }

By default, llama.rn will use pre-built libraries for Android. If you want to build from source, please set rnllamaBuildFromSource to true in android/gradle.properties.

OpenCL (GPU acceleration)

Confirm the target device exposes an OpenCL-capable GPU (Qualcomm Adreno 700+ devices are currently supported & tested).
Add <uses-native-library android:name="libOpenCL.so" android:required="false" /> to your app manifest so the loader can be loaded at runtime.
Configure n_gpu_layers (> 0) when calling initLlama to offload layers to the GPU. The native result exposes gpu, reasonNoGPU, devices, so you can confirm runtime behaviour.

Hexagon (NPU acceleration) (Experimental)

Confirm the target device has HTP (Hexagon Tensor Processor), Qualcomm SM8450+ (8 gen 1 or newer) devices are currently supported & tested).
Add <uses-native-library android:name="libcdsprpc.so" android:required="false" /> to your app manifest so the loader can be loaded at runtime.
Add param devices: ['HTP0'] (or HTP* for all HTP sessions) to use HTP devices.
Configure n_gpu_layers (> 0) when calling initLlama to offload layers to the GPU. The native result exposes gpu, reasonNoGPU, devices, so you can confirm runtime behaviour.

Expo

For use with the Expo framework and CNG builds, you will need expo-build-properties to utilize iOS and OpenCL features. Simply add the following to your app.json/app.config.js file:

module.exports = {
  expo: {
    // ...
    plugins: [
      // ...
      [
        'llama.rn',
        // optional fields, below are the default values
        {
          enableEntitlements: true,
          entitlementsProfile: 'production',
          forceCxx20: true,
          enableOpenCL: true,
        },
      ],
    ],
  },
}

Obtain the model

You can search HuggingFace for available models (Keyword: GGUF).

For get a GGUF model or quantize manually, see quantize documentation in llama.cpp.

Usage

💡 You can find complete examples in the example project.

Load model info only:

import { loadLlamaModelInfo } from 'llama.rn'

const modelPath = 'file://<path to gguf model>'
console.log('Model Info:', await loadLlamaModelInfo(modelPath))

Initialize a Llama context & do completion:

import { initLlama } from 'llama.rn'

// Initial a Llama context with the model (may take a while)
const context = await initLlama({
  model: modelPath,
  use_mlock: true,
  n_ctx: 2048,
  n_gpu_layers: 99, // number of layers to store in GPU memory (Metal/OpenCL)
  // embedding: true, // use embedding
})

const stopWords = ['</s>', '<|end|>', '<|eot_id|>', '<|end_of_text|>', '<|im_end|>', '<|EOT|>', '<|END_OF_TURN_TOKEN|>', '<|end_of_turn|>', '<|endoftext|>']

// Do chat completion
const msgResult = await context.completion(
  {
    messages: [
      {
        role: 'system',
        content: 'This is a conversation between user and assistant, a friendly chatbot.',
      },
      {
        role: 'user',
        content: 'Hello!',
      },
    ],
    n_predict: 100,
    stop: stopWords,
    // ...other params
  },
  (data) => {
    // This is a partial completion callback
    const { token } = data
  },
)
console.log('Result:', msgResult.text)
console.log('Timings:', msgResult.timings)

// Or do text completion
const textResult = await context.completion(
  {
    prompt: 'This is a conversation between user and llama, a friendly chatbot. respond in simple markdown.\n\nUser: Hello!\nLlama:',
    n_predict: 100,
    stop: [...stopWords, 'Llama:', 'User:'],
    // ...other params
  },
  (data) => {
    // This is a partial completion callback
    const { token } = data
  },
)
console.log('Result:', textResult.text)
console.log('Timings:', textResult.timings)

The binding's deisgn inspired by server.cpp example in llama.cpp:

/completion and /chat/completions: context.completion(params, partialCompletionCallback)
/tokenize: context.tokenize(content)
/detokenize: context.detokenize(tokens)
/embedding: context.embedding(content)
/rerank: context.rerank(query, documents, params)
... Other methods

Please visit the Documentation for more details.

You can also visit the example to see how to use it.

Multimodal (Vision & Audio)

llama.rn supports multimodal capabilities including vision (images) and audio processing. This allows you to interact with models that can understand both text and media content.

Supported Media Formats

Images (Vision):

JPEG, PNG, BMP, GIF, TGA, HDR, PIC, PNM
Base64 encoded images (data URLs)
Local file paths
* Not supported HTTP URLs yet

Audio:

WAV, MP3 formats
Base64 encoded audio (data URLs)
Local file paths
* Not supported HTTP URLs yet

Setup

First, you need a multimodal model and its corresponding multimodal projector (mmproj) file, see how to obtain mmproj for more details.

Initialize Multimodal Support

import { initLlama } from 'llama.rn'

// First initialize the model context
const context = await initLlama({
  model: 'path/to/your/multimodal-model.gguf',
  n_ctx: 4096,
  n_gpu_layers: 99, // Recommended for multimodal models
  // Important: Disable context shifting for multimodal
  ctx_shift: false,
})

// Initialize multimodal support with mmproj file
const success = await context.initMultimodal({
  path: 'path/to/your/mmproj-model.gguf',
  use_gpu: true, // Recommended for better performance
})

// Check if multimodal is enabled
console.log('Multimodal enabled:', await context.isMultimodalEnabled())

if (success) {
  console.log('Multimodal support initialized!')

  // Check what modalities are supported
  const support = await context.getMultimodalSupport()
  console.log('Vision support:', support.vision)
  console.log('Audio support:', support.audio)
} else {
  console.log('Failed to initialize multimodal support')
}

// Release multimodal context
await context.releaseMultimodal()

Usage Examples

Vision (Image Processing)

const result = await context.completion({
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: 'What do you see in this image?',
        },
        {
          type: 'image_url',
          image_url: {
            url: 'file:///path/to/image.jpg',
            // or base64: 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD...'
          },
        },
      ],
    },
  ],
  n_predict: 100,
  temperature: 0.1,
})

console.log('AI Response:', result.text)

Audio Processing

// Method 1: Using structured message content (Recommended)
const result = await context.completion({
  messages: [
    {
      role: 'user',
      content: [
        {
          type: 'text',
          text: 'Transcribe or describe this audio:',
        },
        {
          type: 'input_audio',
          input_audio: {
            data: 'data:audio/wav;base64,UklGRiQAAABXQVZFZm10...',
            // or url: 'file:///path/to/audio.wav',
            format: 'wav', // or 'mp3'
          },
        },
      ],
    },
  ],
  n_predict: 200,
})

console.log('Transcription:', result.text)

Tokenization with Media

// Tokenize text with media
const tokenizeResult = await context.tokenize(
  'Describe this image: <__media__>',
  {
    media_paths: ['file:///path/to/image.jpg']
  }
)

console.log('Tokens:', tokenizeResult.tokens)
console.log('Has media:', tokenizeResult.has_media)
console.log('Media positions:', tokenizeResult.chunk_pos_media)

Notes

Context Shifting: Multimodal models require ctx_shift: false to maintain media token positioning
Memory: Multimodal models require more memory; use adequate n_ctx and consider GPU offloading
Media Markers: The system automatically handles <__media__> markers in prompts. When using structured message content, media items are automatically replaced with this marker
Model Compatibility: Ensure your model supports the media type you're trying to proces

Llama.rn

Install / Use

README