SkillAgentSearch skills...

Vmlx

vMLX - Home of JANG_Q - Cont Batch, Prefix, Paged, KV Cache Quant, VL - Powers MLX Studio. Image gen/edit, OpenAI/Anth

Install / Use

/learn @jjang-ai/Vmlx

README

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/jjang-ai/vmlx/main/assets/logo-wide-dark.png"> <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/jjang-ai/vmlx/main/assets/logo-wide-light.png"> <img alt="vMLX" src="https://raw.githubusercontent.com/jjang-ai/vmlx/main/assets/logo-wide-light.png" width="400"> </picture> </p> <h3 align="center">Local AI Engine for Apple Silicon</h3> <p align="center"> Run LLMs, VLMs, and image generation models entirely on your Mac.<br> OpenAI + Anthropic + Ollama compatible API. No cloud. No API keys. No data leaving your machine. </p> <p align="center"> <a href="https://pypi.org/project/vmlx/"><img src="https://img.shields.io/pypi/v/vmlx?color=%234B8BBE&label=PyPI&logo=python&logoColor=white" alt="PyPI" /></a> <a href="https://github.com/jjang-ai/vmlx/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache_2.0-green?logo=apache" alt="License" /></a> <a href="https://github.com/jjang-ai/vmlx"><img src="https://img.shields.io/github/stars/jjang-ai/vmlx?style=social" alt="Stars" /></a> <img src="https://img.shields.io/badge/Apple_Silicon-M1%2FM2%2FM3%2FM4-black?logo=apple" alt="Apple Silicon" /> <img src="https://img.shields.io/badge/Python-3.10+-3776AB?logo=python&logoColor=white" alt="Python" /> <img src="https://img.shields.io/badge/Electron-28-47848F?logo=electron&logoColor=white" alt="Electron" /> <a href="https://ko-fi.com/jangml"><img src="https://img.shields.io/badge/Support-Ko--fi-FF5E5B?logo=ko-fi&logoColor=white" alt="Ko-fi" /></a> </p> <p align="center"> <a href="#quickstart">Quickstart</a> &bull; <a href="#model-support">Models</a> &bull; <a href="#features">Features</a> &bull; <a href="#image-generation--editing">Image Gen</a> &bull; <a href="#api-reference">API</a> &bull; <a href="#desktop-app">Desktop App</a> &bull; <a href="#advanced-quantization">JANG</a> &bull; <a href="#cli-commands">CLI</a> &bull; <a href="#configuration">Config</a> &bull; <a href="#contributing">Contributing</a> &bull; <a href="#한국어-korean">한국어</a> </p>

JANG 2-bit destroys MLX 4-bit on MiniMax M2.5:

| Quantization | MMLU (200q) | Size | |---|---|---| | JANG_2L (2-bit) | 74% | 89 GB | | MLX 4-bit | 26.5% | 120 GB | | MLX 3-bit | 24.5% | 93 GB | | MLX 2-bit | 25% | 68 GB |

Adaptive mixed-precision keeps critical layers at higher precision. Scores at jangq.ai. Models at JANGQ-AI.

<table align="center"> <tr> <td align="center"><img src="https://raw.githubusercontent.com/jjang-ai/vmlx/main/assets/chat-tab.png" width="500" alt="Chat interface" /></td> <td align="center"><img src="https://raw.githubusercontent.com/jjang-ai/vmlx/main/assets/agentic-chat.png" width="500" alt="Agentic coding chat" /></td> </tr> <tr> <td align="center"><em>Chat with any MLX model -- thinking mode, streaming, and syntax highlighting</em></td> <td align="center"><em>Agentic chat with full coding capabilities -- tool use and structured output</em></td> </tr> </table>

Quickstart

Install from PyPI

Published on PyPI as vmlx -- install and run in one command:

# Recommended: uv (fast, no venv hassle)
brew install uv
uv tool install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or: pipx (isolates from system Python)
brew install pipx
pipx install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

# Or: pip in a virtual environment
python3 -m venv ~/.vmlx-env && source ~/.vmlx-env/bin/activate
pip install vmlx
vmlx serve mlx-community/Qwen3-8B-4bit

Note: On macOS 14+, bare pip install fails with "externally-managed-environment". Use uv, pipx, or a venv.

Your local AI server is now running at http://0.0.0.0:8000 with an OpenAI + Anthropic compatible API. Works with any model from mlx-community -- thousands of models ready to go.

Or download the desktop app

Get MLX Studio -- a native macOS app with chat UI, model management, image generation, and developer tools. No terminal required. Just download the DMG and drag to Applications.

Use with OpenAI SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello!"}],
    stream=True,
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Use with Anthropic SDK

import anthropic

client = anthropic.Anthropic(base_url="http://localhost:8000/v1", api_key="not-needed")
message = client.messages.create(
    model="local",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)

Use with curl

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Model Support

vMLX runs any MLX model. Point it at a HuggingFace repo or local path and go.

| Type | Models | |------|--------| | Text LLMs | Qwen 2/2.5/3/3.5, Llama 3/3.1/3.2/3.3/4, Mistral/Mixtral, Gemma 3, Phi-4, DeepSeek, GLM-4, MiniMax, Nemotron, StepFun, and any mlx-lm model | | Vision LLMs | Qwen-VL, Qwen3.5-VL, Pixtral, InternVL, LLaVA, Gemma 3n | | MoE Models | Qwen 3.5 MoE (A3B/A10B), Mixtral, DeepSeek V2/V3, MiniMax M2.5, Llama 4 | | Hybrid SSM | Nemotron-H, Jamba, GatedDeltaNet (Mamba + Attention) | | Image Gen | Flux Schnell/Dev, Z-Image Turbo (via mflux) | | Image Edit | Qwen Image Edit (via mflux) | | Embeddings | Any mlx-lm compatible embedding model | | Reranking | Cross-encoder reranking models | | Audio | Kokoro TTS, Whisper STT (via mlx-audio) |


Features

Inference Engine

| Feature | Description | |---------|-------------| | Continuous Batching | Handle multiple concurrent requests efficiently | | Prefix Cache | Reuse KV states for repeated prompts -- makes follow-up messages instant | | Paged KV Cache | Block-based caching with content-addressable deduplication | | KV Cache Quantization | Compress cached states to q4/q8 for 2-4x memory savings | | Disk Cache (L2) | Persist prompt caches to SSD -- survives server restarts | | Block Disk Cache | Per-block persistent cache paired with paged KV cache | | Speculative Decoding | Small draft model proposes tokens for 20-90% speedup | | Prompt Lookup Decoding | No draft model needed — reuses n-gram matches from the prompt/context. Best for structured or repetitive output (code, JSON, schemas). Enable with --enable-pld. | | JIT Compilation | mx.compile Metal kernel fusion (experimental) | | Hybrid SSM Support | Mamba/GatedDeltaNet layers handled correctly alongside attention |

5-Layer Cache Architecture

Request -> Tokens
    |
L1: Memory-Aware Prefix Cache (or Paged Cache)
    | miss
L2: Disk Cache (or Block Disk Store)
    | miss
Inference -> float16 KV states
    |
KV Quantization -> q4/q8 for storage
    |
Store back into L1 + L2

Tool Calling

Auto-detected parsers for every major model family:

qwen - llama - mistral - hermes - deepseek - glm47 - minimax - nemotron - granite - functionary - xlam - kimi - step3p5

Reasoning / Thinking Mode

Auto-detected reasoning parsers that extract <think> blocks:

qwen3 (Qwen3, QwQ, MiniMax, StepFun) - deepseek_r1 (DeepSeek R1, Gemma 3, GLM, Phi-4) - openai_gptoss (GLM Flash, GPT-OSS)

Audio

| Feature | Description | |---------|-------------| | Text-to-Speech | Kokoro TTS via mlx-audio -- multiple voices, streaming output | | Speech-to-Text | Whisper STT via mlx-audio -- transcription and translation |


Image Generation & Editing

Generate and edit images locally with Flux models via mflux.

pip install vmlx[image]

# Image generation
vmlx serve schnell                    # or dev, z-image-turbo
vmlx serve ~/.mlxstudio/models/image/flux1-schnell-4bit

# Image editing
vmlx serve qwen-image-edit            # instruction-based editing

Generation API

curl http://localhost:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "model": "schnell",
    "prompt": "A cat astronaut floating in space with Earth in the background",
    "size": "1024x1024",
    "n": 1
  }'
# Python (OpenAI SDK)
response = client.images.generate(
    model="schnell",
    prompt="A cat astronaut floating in space",
    size="1024x1024",
    n=1,
)

Editing API

# Edit an image with a text prompt (Flux Kontext / Qwen Image Edit)
curl http://localhost:8000/v1/images/edits \
  -H "Content-Type: application/json" \
  -d '{
    "model": "flux-kontext",
    "prompt": "Change the background to a sunset",
    "image": "<base64-encoded-image>",
    "size": "1024x1024",
    "strength": 0.8
  }'
# Python
import base64
with open("source.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = requests.post("http://localhost:8000/v1/images/edits", json={
    "model": "flux-kontext",
    "prompt": "Make the sky purple",
    "image": image_b64,
    "size": "1024x1024",
    "strength": 0.8,
})

Supported Image Models

Generation Models:

| Model | Steps | Speed | Memory | |-------|-------|-------|--------| | Flux Schnell | 4 | Fastest | ~6-24 GB | | Z-Image Turbo | 4 | Fast | ~6-24 GB | | Flux Dev | 20 | Slow | ~6-24 GB |

Editing Models:

| Model | Steps | Type | Memory | |-------|--

View on GitHub
GitHub Stars78
CategoryDevelopment
Updated3h ago
Forks9

Languages

Python

Security Score

100/100

Audited on Mar 28, 2026

No findings