Llamaj.cpp
A port of https://github.com/ggml-org/llama.cpp on the JVM using jextract
Install / Use
/learn @gravitee-io/Llamaj.cppREADME
Llamaj.cpp
Llamaj.cpp is a Java and JVM port of llama.cpp using jextract, enabling local large language model (LLM) inference through native foreign function & memory API interop. Natively supports macOS M-series and Linux x86_64 with GPU acceleration. Platform and hardware support (Windows, ARM, CUDA, etc.) can be extended through custom builds.
Keywords
llama.cpp · java · jvm · llm · large language models · inference · ai · native interop · foreign function & memory api · jextract
Requirements
- Java 25
- mvn
- MacOS M-series / Linux x86_64 (CPU) (you can check the last section if you do not see your platform here)
How to use
Include the dependency in your pom.xml
<dependencies>
...
<dependency>
<groupId>io.gravitee.llama.cpp</groupId>
<artifactId>llamaj.cpp</artifactId>
<version>x.x.x</version>
</dependency>
</dependencies>
Note: All examples below use
LlamaVocabto handle tokenization. It's obtained from a loadedLlamaModeland is essential for converting between tokens and text representations.
Example 1: Basic Conversation
import io.gravitee.llama.cpp.*;
import java.lang.foreign.Arena;
import java.nio.file.Path;
public class BasicExample {
public static void main(String[] args) {
var arena = Arena.ofConfined();
// Initialize runtime
LlamaRuntime.llama_backend_init();
// Load model
var modelParams = new LlamaModelParams(arena);
var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);
// Create context
var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);
var context = new LlamaContext(model, contextParams);
// Set up tokenizer and sampler
var vocab = new LlamaVocab(model);
var tokenizer = new LlamaTokenizer(vocab, context);
var sampler = new LlamaSampler(arena)
.temperature(0.7f)
.topK(40)
.topP(0.9f, 1)
.seed(42);
// Create conversation state
var state = ConversationState.create(arena, context, tokenizer, sampler, 0)
.setMaxTokens(100)
.initialize("What is the capital of France?");
// Generate response
var iterator = new DefaultLlamaIterator(state);
while (iterator.hasNext()) {
var output = iterator.next();
System.out.print(output.text());
}
// Cleanup
context.free();
sampler.free();
model.free();
LlamaRuntime.llama_backend_free();
}
}
Example 2: Log Probabilities
Enable log-probability collection to inspect the model's confidence at each token position.
Set topLogprobs to the number of top-alternative tokens you want alongside the sampled one (0 = disabled, no overhead):
import io.gravitee.llama.cpp.*;
import java.lang.foreign.Arena;
import java.nio.file.Path;
public class LogprobsExample {
public static void main(String[] args) {
var arena = Arena.ofConfined();
LlamaRuntime.llama_backend_init();
var model = new LlamaModel(arena, Path.of("models/model.gguf"), new LlamaModelParams(arena));
var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);
var context = new LlamaContext(arena, model, contextParams);
var vocab = new LlamaVocab(model);
var tokenizer = new LlamaTokenizer(vocab, context);
var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);
var state = ConversationState.create(arena, context, tokenizer, sampler)
.setMaxTokens(50)
.setTopLogprobs(5) // return top-5 alternatives at every token position
.initialize("What is the capital of France?");
var iterator = new DefaultLlamaIterator(state);
while (iterator.hasNext()) {
var output = iterator.next();
System.out.print(output.text());
Logprobs lp = output.logprobs();
System.out.printf("%n chosen: \"%s\" logprob=%.4f%n",
lp.chosenToken().token(), lp.chosenToken().logprob());
lp.topLogprobs().forEach(t ->
System.out.printf(" alt: \"%s\" logprob=%.4f%n", t.token(), t.logprob()));
}
context.free();
sampler.free();
model.free();
LlamaRuntime.llama_backend_free();
}
}
Each LlamaOutput carries a Logprobs object with:
chosenToken()— the token that was sampled, its text, vocabulary ID, log-probability, and raw UTF-8 bytestopLogprobs()— up to N alternatives sorted by descending log-probability; the chosen token is always included
When topLogprobs is 0 (the default), output.logprobs() is null and no logit processing is done.
Example 3: Parallel Conversations
Process multiple conversations simultaneously in a single batch:
import io.gravitee.llama.cpp.*;
import java.lang.foreign.Arena;
import java.nio.file.Path;
public class ParallelExample {
public static void main(String[] args) {
var arena = Arena.ofConfined();
// Initialize runtime
LlamaRuntime.llama_backend_init();
// Load model
var modelParams = new LlamaModelParams(arena);
var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);
// Create context with multi-sequence support
var contextParams = new LlamaContextParams(arena)
.nCtx(2048)
.nBatch(512)
.nSeqMax(4); // Support up to 4 parallel conversations
var context = new LlamaContext(model, contextParams);
// Set up shared tokenizer and sampler
var vocab = new LlamaVocab(model);
var tokenizer = new LlamaTokenizer(vocab, context);
var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);
// Create multiple conversation states with unique sequence IDs
var state1 = ConversationState.create(arena, context, tokenizer, sampler, 0)
.setMaxTokens(100).initialize("What is the capital of France?");
var state2 = ConversationState.create(arena, context, tokenizer, sampler, 1)
.setMaxTokens(100).initialize("What is the capital of England?");
var state3 = ConversationState.create(arena, context, tokenizer, sampler, 2)
.setMaxTokens(100).initialize("What is the capital of Poland?");
// Create parallel iterator - prompts are auto-processed when states are added
var parallel = new BatchIterator(arena, context, 512, 4)
.addState(state1)
.addState(state2)
.addState(state3);
// Generate tokens in parallel
System.out.println("=== Parallel Generation ===");
while (parallel.hasNext()) {
// Each hasNext() generates tokens for all active conversations
// Get all outputs from this batch (one per active conversation)
var outputs = parallel.getOutputs();
for (var output : outputs) {
System.out.println("Seq " + output.sequenceId() + ": " + output.text());
}
}
System.out.println();
// Print results
System.out.println("Conversation 1: " + state1.getAnswer());
System.out.println(" Tokens: " + state1.getAnswerTokens());
System.out.println("Conversation 2: " + state2.getAnswer());
System.out.println(" Tokens: " + state2.getAnswerTokens());
System.out.println("Conversation 3: " + state3.getAnswer());
System.out.println(" Tokens: " + state3.getAnswerTokens());
// Cleanup
parallel.free();
context.free();
sampler.free();
model.free();
LlamaRuntime.llama_backend_free();
}
}
Example 4: Distributed Inference with RPC
Offload model weights and KV-cache to remote machines using the RPC backend.
When using --rpc, weights are loaded exclusively on the remote servers -- the local GPU is not used.
Start RPC server nodes first (see containers/README.md):
# On the remote machine (or another terminal)
./scripts/start-rpc-server.sh
Then connect from Java:
import io.gravitee.llama.cpp.*;
import io.gravitee.llama.cpp.nativelib.LlamaLibLoader;
import java.lang.foreign.Arena;
import java.nio.file.Path;
public class RpcExample {
public static void main(String[] args) {
var arena = Arena.ofConfined();
// Initialize runtime
String libPath = LlamaLibLoader.load();
LlamaRuntime.llama_backend_init();
// Register remote RPC servers -- returns their device handles
var rpcDevices = BackendRegistry.addRpcServer(arena, "127.0.0.1:50052");
// Print all discovered backends and devices
BackendRegistry.printSummary();
// Load model, restricting offloading to only the RPC devices
var modelParams = new LlamaModelParams(arena)
.devices(arena, rpcDevices)
.nGpuLayers(999);
var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);
// Everything else works exactly the same as loc
