Omlx
LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar
Install / Use
/learn @jundot/OmlxREADME
<p align="center"> <img src="docs/images/omlx_dashboard.png" alt="oMLX Admin Dashboard" width="800"> </p>
Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.
oMLX persists KV cache across a hot in-memory tier and cold SSD tier - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.
Install
macOS App
Download the .dmg from Releases, drag to Applications, done. The app includes in-app auto-update, so future upgrades are just one click.
Homebrew
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx
# Upgrade to the latest version
brew update && brew upgrade omlx
# Run as a background service (auto-restarts on crash)
brew services start omlx
# Optional: MCP (Model Context Protocol) support
/opt/homebrew/opt/omlx/libexec/bin/pip install mcp
From Source
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e . # Core only
pip install -e ".[mcp]" # With MCP (Model Context Protocol) support
Requires macOS 15.0+ (Sequoia), Python 3.10+, and Apple Silicon (M1/M2/M3/M4).
Quickstart
macOS App
Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That's it. To connect OpenClaw, OpenCode, or Codex, see Integrations.
<p align="center"> <img src="docs/images/Screenshot 2026-02-10 at 00.36.32.png" alt="oMLX Welcome Screen" width="360"> <img src="docs/images/Screenshot 2026-02-10 at 00.34.30.png" alt="oMLX Menubar" width="240"> </p>CLI
omlx serve --model-dir ~/models
The server discovers LLMs, VLMs, embedding models, and rerankers from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1. A built-in chat UI is also available at http://localhost:8000/admin/chat.
Homebrew Service
If you installed via Homebrew, you can run oMLX as a managed background service:
brew services start omlx # Start (auto-restarts on crash)
brew services stop omlx # Stop
brew services restart omlx # Restart
brew services info omlx # Check status
The service runs omlx serve with zero-config defaults (~/.omlx/models, port 8000). To customize, either set environment variables (OMLX_MODEL_DIR, OMLX_PORT, etc.) or run omlx serve --model-dir /your/path once to persist settings to ~/.omlx/settings.json.
Logs are written to two locations:
- Service log:
$(brew --prefix)/var/log/omlx.log(stdout/stderr) - Server log:
~/.omlx/logs/server.log(structured application log)
Features
Supports text LLMs, vision-language models (VLM), OCR models, embeddings, and rerankers on Apple Silicon.
Admin Dashboard
Web UI at /admin for real-time monitoring, model management, chat, benchmark, and per-model settings. Supports English, Korean, Japanese, and Chinese. All CDN dependencies are vendored for fully offline operation.
Vision-Language Models
Run VLMs with the same continuous batching and tiered KV cache stack as text LLMs. Supports multi-image chat, base64/URL/file image inputs, and tool calling with vision context. OCR models (DeepSeek-OCR, DOTS-OCR, GLM-OCR) are auto-detected with optimized prompts.
Tiered KV Cache (Hot + Cold)
Block-based KV cache management inspired by vLLM, with prefix sharing and Copy-on-Write. The cache operates across two tiers:
- Hot tier (RAM): Frequently accessed blocks stay in memory for fast access.
- Cold tier (SSD): When the hot cache fills up, blocks are offloaded to SSD in safetensors format. On the next request with a matching prefix, they're restored from disk instead of recomputed from scratch - even after a server restart.
Continuous Batching
Handles concurrent requests through mlx-lm's BatchGenerator. Prefill and completion batch sizes are configurable.
Claude Code Optimization
Context scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.
Multi-Model Serving
Load LLMs, VLMs, embedding models, and rerankers within the same server. Models are managed through a combination of automatic and manual controls:
- LRU eviction: Least-recently-used models are evicted automatically when memory runs low.
- Manual load/unload: Interactive status badges in the admin panel let you load or unload models on demand.
- Model pinning: Pin frequently used models to keep them always loaded.
- Per-model TTL: Set an idle timeout per model to auto-unload after a period of inactivity.
- Process memory enforcement: Total memory limit (default: system RAM - 8GB) prevents system-wide OOM.
Per-Model Settings
Configure sampling parameters, chat template kwargs, TTL, model alias, model type override, and more per model directly from the admin panel. Changes apply immediately without server restart.
- Model alias: set a custom API-visible name.
/v1/modelsreturns the alias, and requests accept both the alias and directory name. - Model type override: manually set a model as LLM or VLM regardless of auto-detection.
Built-in Chat
Chat directly with any loaded model from the admin panel. Supports conversation history, model switching, dark mode, reasoning model output, and image upload for VLM/OCR models.
<p align="center"> <img src="docs/images/ScreenShot_2026-03-14_104350_610.png" alt="oMLX Chat" width="720"> </p>Model Downloader
Search and download MLX models from HuggingFace directly in the admin dashboard. Browse model cards, check file sizes, and download with one click.
<p align="center"> <img src="docs/images/downloader_omlx.png" alt="oMLX Model Downloader" width="720"> </p>Integrations
Set up OpenClaw, OpenCode, and Codex directly from the admin dashboard with a single click. No manual config editing required.
<p align="center"> <img src="docs/images/omlx_integrations.png" alt="oMLX Integrations" width="720"> </p>Performance Benchmark
One-click benchmarking from the admin panel. Measures prefill (PP) and text generation (TG) tokens per second, with partial prefix cache hit testing for realistic performance numbers.
<p align="center"> <img src="docs/images/benchmark_omlx.png" alt="oMLX Benchmark Tool" width="720"> </p>macOS Menubar App
Native PyObjC menubar app (not Electron). Start, stop, and monitor the server without opening a terminal. Includes persistent serving stats (survives restarts), auto-restart on crash, and in-app auto-update.
<p align="center"> <img src="docs/images/Screenshot 2026-02-10 at 00.51.54.png" alt="oMLX Menubar Stats" width="400"> </p>API Compatibility
Drop-in replacement for OpenAI and Anthropic APIs. Supports streaming usage stats (stream_options.include_usage), Anthropic adaptive thinking, and vision inputs (base64, URL).
| Endpoint | Description |
|----------|-------------|
| POST /v1/chat/completions | Chat completions (streaming) |
| POST /v1/completions | Text completions (streaming) |
| POST /v1/messages | Anthropic Messages API |
| POST /v1/embeddings | Text embeddings |
| POST /v1/rerank | Document reranking |
| GET /v1/models | List available models |
Tool Calling & Structured Output
Supports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the tools parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:
| Model Family | Format | |-
Related Skills
node-connect
338.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.6kCommit, push, and open a PR
