Ullm

Lightweight LLM inference engine inspired by nano-vllm, with radix-tree based prefix cache, tp & pp, cuda graph, openai api, async scheduling, and more.

Generate Convert Improve

Install / Use

/learn @izmttk/Ullm

About this skill

Quality Score

0/100

README

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-dark.png"> <img alt="vLLM" src="docs/assets/logo-light.png" width="400"> </picture> </p> <p align="center"> μLLM: A lightweight vLLM-like LLM inference engine with radix-tree based KV cache, and more. </p>

该项目受 nano-vllm 启发，提供一个从零开始构建的 LLM 推理框架

程序的架构遵循了 vLLM v1 相似的组织安排，但是与 vLLM 不同的是，该项目的 KV 缓存系统使用的是 SGLang 的 Radix Cache 实现。

Features

轻量但完整的代码实现
FCFS 调度和持续批处理（Continuous Batching）
OpenAI 兼容 API 服务
基于 Radix Tree 的 Prefix Caching
Flash Attention 算子支持（FlashInfer 实现）
张量并行（Tensor Parallelism）
流水线并行（Pipeline Parallelism）
CUDA Graph 支持（仅 Decoding 阶段）
增量调度和有状态的 Worker
Torch Profiler 支持
完全异步调度（Fully Async Scheduling）

Requirements

torch>=2.8.0
numpy
triton>=3.0.0
transformers>=4.51.0,<=4.57.3
fastapi>=0.95.0
uvicorn
flashinfer-python>=0.2.7
psutil
pyzmq>=25.0.0
msgspec
cloudpickle
tqdm

NOTE: 我发现当 nccl 版本 < 2.27.3 时，分布式通信和环境销毁可能会存在卡死的问题，建议升级 nvidia-nccl-cu12 至 2.27.3 及以上版本，torch 2.8.0 的依赖中已经包含该版本的 nccl。如果必须使用低版本 torch，请手动升级 nccl 版本 >= 2.27.3。

Installation

git clone https://github.com/izmttk/ullm && cd ullm
pip install -e .

当然你也可以选择不安装，直接在项目根目录下运行服务。

Quick Start

启动 API 服务

example:
python -m ullm.entrypoints.api_server --model Qwen3-0.6B --gpu-memory-utilization 0.9 --tp-size 2 --pp-size 2 --context-len 4096 --host 0.0.0.0 --port 8000

usage: api_server.py [-h] [--host HOST] [--port PORT] --model MODEL [--gpu-memory-utilization GPU_MEMORY_UTILIZATION] [--max-bs MAX_BS] [--tp-size TP_SIZE] [--pp-size PP_SIZE]
                     [--nccl-port NCCL_PORT] [--device-ids DEVICE_IDS] [--context-len CONTEXT_LEN] [--enforce-eager] [--log-level LOG_LEVEL] [--profile] [--profile-dir PROFILE_DIR]

LLM Distributed OpenAI-Compatible API Server

options:
  -h, --help            show this help message and exit
  --host HOST           Host name
  --port PORT           Port number
  --model MODEL         Model name
  --gpu-memory-utilization GPU_MEMORY_UTILIZATION
                        GPU memory utilization
  --max-bs MAX_BS       Maximum batch size
  --tp-size TP_SIZE     Tensor parallel size
  --pp-size PP_SIZE     Pipeline parallel size
  --nccl-port NCCL_PORT
                        NCCL port for distributed run
  --device-ids DEVICE_IDS
                        Comma-separated list of GPU device IDs to use
  --context-len CONTEXT_LEN
                        Max context length of the model
  --enforce-eager       Enforce eager execution, disable CUDA graph
  --log-level LOG_LEVEL
                        Log level for the engine
  --profile             Enable profiling support
  --profile-dir PROFILE_DIR
                        Directory to save profiling results
  --disable-async-scheduling
                        Disable asynchronous scheduling

Offline Inference

from ullm import LLM, SamplingParams, EngineConfig
import asyncio

async def main():
    config = EngineConfig(
        model="Qwen3-0.6B",
        gpu_memory_utilization=0.9,
        tp_size=2,
        pp_size=2,
        context_len=4096,
        enforce_eager=False,
    )
    llm = LLM(config)

    prompt = "Once upon a time"
    async for token in llm.generate(
        prompt,
        SamplingParams(
            max_new_tokens=50,
            temperature=0.6,
            top_p=0.95,
            top_k=20,
        )
    ):
        print(token, end="", flush=True)

if __name__ == "__main__":
    asyncio.run(main())

Benchmarks

Experiment Environment:

GPU: A100 40GB
Model: Qwen3-0.6B
Number of Requests: 256
Prompt Length: random 100 ~ 1024
Generation Length: random 100 ~ 1024
Script: bench.py

Results:

| Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) | |------------------|---------------|----------|-----------------------| | vLLM v0.11.0 | 133966 | 18.24 | 7343.96 | | ours | 133966 | 14.83 | 9032.37 |

TODO

[x] Graceful Shutdown
[x] Better Logging System
[x] Multi proc communication optimizations
[x] More Configurable Options
[x] Profiling
[x] Increamental Batch Scheduling and Stateful Workers
[x] Fully Async Scheduling (Overlap Scheduling/Zero-Overhead Scheduling)
[ ] Further Improvements based on Profiling
[ ] Benchmark Metrics on API Server

Further development is still ongoing.

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

izmttk

View profile

View on GitHub

GitHub Stars9

CategoryEducation

Updated5d ago

Forks0

izmttk/ullm

Languages

Python

Security Score

90/100

Audited on Mar 29, 2026

No findings