Rllama
Ruby FFI bindings for llama.cpp to run open-source LLMs such as GPT-OSS, Qwen 3, Gemma 3, and Llama 3 locally with Ruby.
Install / Use
/learn @docusealco/RllamaREADME
Rllama
Ruby bindings for llama.cpp to run open-source language models locally. Run models like GPT-OSS, Qwen 3, Gemma 3, Llama 3, and many others directly in your Ruby application code.
Installation
Add this line to your application's Gemfile:
gem 'rllama'
And then execute:
bundle install
Or install it yourself as:
gem install rllama
CLI Chat
The rllama command-line utility provides an interactive chat interface for conversing with language models. After installing the gem, you can start chatting immediately:
rllama
When you run rllama without arguments, it will display:
- Downloaded models: Any models you've already downloaded to
~/.rllama/models/ - Popular models: A curated list of popular models available for download, including:
- Gemma 3 1B
- Llama 3.2 3B
- Phi-4
- Qwen3 30B
- GPT-OSS
Simply enter the number of the model you want to use. If you select a model that hasn't been downloaded yet, it will be automatically downloaded from Hugging Face.
You can also specify a model path or URL directly:
rllama path/to/your/model.gguf
rllama https://huggingface.co/microsoft/phi-4-gguf/resolve/main/phi-4-Q3_K_S.gguf
Once the model has loaded, you can start chatting.
Usage
Text Generation
Generate text completions using local language models:
require 'rllama'
# Load a model
model = Rllama.load_model('lmstudio-community/gemma-3-1B-it-QAT-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf')
# Generate text
result = model.generate('What is the capital of France?')
puts result.text
# => "The capital of France is Paris."
# Access generation statistics
puts "Tokens generated: #{result.stats[:tokens_generated]}"
puts "Tokens per second: #{result.stats[:tps]}"
puts "Duration: #{result.stats[:duration]} seconds"
# Don't forget to close the model when done
model.close
Generation parameters
Adjust the generation with parameters:
result = model.generate(
'Write a short poem about Ruby programming',
max_tokens: 2024,
temperature: 0.8,
top_k: 40,
top_p: 0.95,
min_p: 0.05
)
Streaming generation
Stream generated text token-by-token:
model.generate('Explain quantum computing') do |token|
print token
end
System prompt
Include system promt to guide model behavior:
result = model.generate(
'What are best practices for Ruby development?',
system: 'You are an expert Ruby developer with 10 years of experience.'
)
Messages list
Pass multiple messages with roles for more complex interactions:
result = model.generate([
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' },
{ role: 'assistant', content: 'The capital of France is Paris.' },
{ role: 'user', content: 'What is its population?' }
])
puts result.text
Chat
For ongoing conversations, use a context object that maintains the conversation history:
# Initialize a chat context
context = model.init_context
# Send messages and maintain conversation history
response1 = context.message('What is the capital of France?')
puts response1.text
# => "The capital of France is Paris."
response2 = context.message('What is the population of that city?')
puts response2.text
# => "Paris has a population of approximately 2.1 million people..."
response3 = context.message('What was my first message?')
puts response3.text
# => "Your first message was asking about the capital of France."
# The context remembers all previous messages in the conversation
# Close context when done
context.close
Embeddings
Generate vector embeddings for text using embedding models:
require 'rllama'
# Load an embedding model
model = Rllama.load_model('lmstudio-community/embeddinggemma-300m-qat-GGUF/embeddinggemma-300m-qat-Q4_0.gguf')
# Generate embedding for a single text
embedding = model.embed('Hello, world!')
puts embedding.length
# => 724 (depending on your model)
# Generate embeddings for multiple sentences
embeddings = model.embed([
'roses are red',
'violets are blue',
'sugar is sweet'
])
puts embeddings.length
# => 3
puts embeddings[0].length
# => 768
model.close
Vector parameters
By default, embedding vectors are normalized. You can disable normalization with normalize: false:
# Generate unnormalized embeddings
embedding = model.embed('Sample text', normalize: false)
Finding Models
You can download GGUF format models from various sources:
- Hugging Face - Search for models with "GGUF" format
License
MIT
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/docusealco/rllama.
Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
