<p align="center"> <img alt="SMG Logo" src="https://raw.githubusercontent.com/lightseekorg/smg/main/docs/assets/images/logos/logomark-dark.svg" width="80"> </p> <h1 align="center">Shepherd Model Gateway</h1> <p align="center"> <a href="https://github.com/lightseekorg/smg/releases/latest"><img src="https://img.shields.io/github/v/release/lightseekorg/smg?logo=github&label=Release" alt="Release"></a> <a href="https://github.com/orgs/lightseekorg/packages/container/package/smg"><img src="https://img.shields.io/badge/ghcr.io-lightseekorg%2Fsmg-blue?logo=docker" alt="Docker"></a> <a href="https://pypi.org/project/smg/"><img src="https://img.shields.io/pypi/v/smg?logo=pypi&logoColor=white&label=PyPI" alt="PyPI"></a> <a href="LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="License"></a> <a href="https://lightseekorg.github.io/smg"><img src="https://img.shields.io/badge/docs-latest-brightgreen.svg" alt="Docs"></a> <a href="https://discord.lightseek.org"><img src="https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white" alt="Discord"></a> <a href="https://slack.lightseek.org"><img src="https://img.shields.io/badge/Slack-Join%20Us-4A154B?logo=slack&logoColor=white" alt="Slack"></a> <a href="https://deepwiki.com/lightseekorg/smg"><img src="https://deepwiki.com/badge.svg" alt="Ask DeepWiki"></a> </p>

High-performance model-routing gateway for large-scale LLM deployments. Centralizes worker lifecycle management, balances traffic across HTTP/gRPC/OpenAI-compatible backends, and provides enterprise-ready control over history storage, MCP tooling, and privacy-sensitive workflows.

Why SMG?

| | | |:--------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------| | 🚀 Maximize GPU Utilization | Cache-aware routing understands your inference engine's KV cache state—whether SGLang, vLLM, or TensorRT-LLM—to reuse prefixes and reduce redundant computation. | | 🔌 One API, Any Backend | Route to self-hosted models (SGLang, vLLM, TensorRT-LLM) or cloud providers (OpenAI, Anthropic, Gemini, Bedrock, and more) through a single unified endpoint. | | ⚡ Built for Speed | Native Rust with gRPC pipelines, sub-millisecond routing decisions, and zero-copy tokenization. Circuit breakers and automatic failover keep things running. | | 🔒 Enterprise Control | Multi-tenant rate limiting with OIDC, WebAssembly plugins for custom logic, and a privacy boundary that keeps conversation history within your infrastructure. | | 📊 Full Observability | 40+ Prometheus metrics, OpenTelemetry tracing, and structured JSON logs with request correlation—know exactly what's happening at every layer. |

API Coverage: OpenAI Chat/Completions/Embeddings, Responses API for agents, Anthropic Messages, and MCP tool execution.

Quick Start

Install — pick your preferred method:

# Docker
docker pull lightseekorg/smg:latest

# Python
pip install smg

# Rust
cargo install smg

Run — point SMG at your inference workers:

# Single worker
smg --worker-urls http://localhost:8000

# Multiple workers with cache-aware routing
smg --worker-urls http://gpu1:8000 http://gpu2:8000 --policy cache_aware

# With high availability mesh
smg --worker-urls http://gpu1:8000 --ha-mesh --seeds 10.0.0.2:30001,10.0.0.3:30001

Use — send requests to the gateway:

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3", "messages": [{"role": "user", "content": "Hello!"}]}'

That's it. SMG is now load-balancing requests across your workers.

Supported Backends

| Self-Hosted | Cloud Providers | |-------------|-----------------| | vLLM | OpenAI | | SGLang | Anthropic | | TensorRT-LLM | Google Gemini | | Ollama | AWS Bedrock | | Any OpenAI-compatible server | Azure OpenAI |

Features

| Feature | Description | |---------|-------------| | 8 Routing Policies | cache_aware, round_robin, power_of_two, consistent_hashing, prefix_hash, manual, random, bucket | | gRPC Pipeline | Native gRPC with streaming, reasoning extraction, and tool call parsing | | MCP Integration | Connect external tool servers via Model Context Protocol | | High Availability | Mesh networking with SWIM protocol for multi-node deployments | | Chat History | Pluggable storage: PostgreSQL, Oracle, Redis, or in-memory | | WASM Plugins | Extend with custom WebAssembly logic | | Resilience | Circuit breakers, retries with backoff, rate limiting |

Documentation

Contributing

We welcome contributions! See Contributing Guide for details.

Smg

Install / Use

README