Lsglang
Lsglang is a special extension of sglang that fully utilizes CPU and GPU computing resources with an efficient GPU parallel + NUMA parallel architecture, suitable for MOE model hybrid inference.
Install / Use
/learn @guqiong96/LsglangREADME
Lsglang GPU、NUMA Dual Parallel [中文]
Lsglang is a special extension of sglang that fully utilizes CPU and GPU computing resources with an efficient GPU parallel + NUMA parallel architecture, suitable for MoE model hybrid inference.
System Features
- GPU + NUMA Dual Parallel: Supports CPU-GPU hybrid decoding, CPU-GPU hybrid prefill, and GPU prefill computing modes
- VRAM + Memory Load Balancing: Total model occupancy = VRAM + Memory, accommodating 1+1=2 models with 100% VRAM utilization <sup>Note 1</sup>
- GPU Prefill Optimization: GPU prefill runs in parallel with CPU-GPU hybrid decoding, achieving nearly 100% GPU utilization
- NUMA Thread Optimization: Cross-node communication ratio as low as 3%, L3 cache hit rate over 50%, decoding phase can drive GPU load to 33% to 50%
Relationship with sglang
Lsglang uses the latest sglang source code and has redesigned and implemented the MoE model hybrid inference module while maintaining 100% full compatibility with sglang.<sup>Note 1</sup>.
Note 1: x86 CPUs with AVX2 or above instruction sets and Nvidia GPUs are supported.
Usage Guide [中文]
- Version Changes
- How to Run Qwen3.5-122B-A10B
- How to Run Qwen3.5-397B-A17B
- How to Run MiniMax-M2.5
- How to Run GLM-5.1-FP8
- How to Run Kimi K2.5
- How to Run Qwen3-Coder-Next-FP8
- Supported Models
- Performance Reference
- Configuration Parameters
- Installation Steps
- Update
- Optimization
Version Changes
2026-04-06: Lsglang-v1.2.0 - improve LK_POWER_SAVING=1 for power saving, support FP8+BF16+AWQ4bit hybrid MoE layer inference
2026-04-03: Lsglang-v1.1.4 - support local compilation of sgl-kernel, fix known issues
2026-03-11: Lsglang-v1.1.3 - FP8、AWQ4bit MoE Models enable GPU Prefill acceleration without additional memory occupation, FP8 MoE Model cancel TO_DTYPE runtime type conversion, KEEP model temporarily not support GPU Prefill
Note 1:30 series graphics cards can enable GPU Prefill acceleration for FP8 models by removing the LVLLM_GPU_RESIDENT_MOE_LAYERS parameter.
2026-03-05: Lsglang-v1.1.0 - support GPU prefill, update corresponding commands (FP8 models do not support enabling GPU prefill on architectures below 3090)
2026-02-25: Lsglang-v1.0.6 - fix known issues, support new models
2026-02-10: Lsglang-v1.0.0 - Ported from the LvLLM project [https://github.com/guqiong96/Lvllm], verified BF16, F16 original models, FP8 original models, and AWQ 4-bit symmetric quantization models.
How to Run Qwen3.5-122B-A10B
pip uninstall transformers -y
pip install transformers==5.3.0
PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
--model /home/guqiong/Models/Qwen3.5-122B-A10B \
--served-model-name Qwen3.5-122B-A10B \
--host 0.0.0.0 \
--port 8070 \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-running-requests 2 \
--enable-p2p-check \
--chunked-prefill-size 32000 \
--max-total-tokens 66000 \
--mem-fraction-static 0.90 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend triton \
--fp8-gemm-backend triton \
--kv-cache-dtype bf16 \
--disable-piecewise-cuda-graph
How to Run Qwen3.5-397B-A17B
pip uninstall transformers -y
pip install transformers==5.3.0
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h
PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
--model "/home/guqiong/Models/Qwen3.5-397B-A17B" \
--served-model-name "Qwen3.5-397B-A17B" \
--host 0.0.0.0 \
--port 8070 \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-running-requests 2 \
--enable-p2p-check \
--chunked-prefill-size 32000 \
--max-total-tokens 66000 \
--mem-fraction-static 0.90 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend triton \
--fp8-gemm-backend triton \
--kv-cache-dtype bf16 \
--disable-piecewise-cuda-graph
# Multi-Token Prediction (MTP) \
# --reasoning-parser qwen3 \
# --speculative-algo NEXTN \
# --speculative-num-steps 3 \
# --speculative-eagle-topk 1 \
# --speculative-num-draft-tokens 4 \
# Processing Ultra-Long Texts
# --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}'
How to Run MiniMax-M2.5
pip uninstall transformers -y
pip install transformers==5.3.0
PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
--model "/home/guqiong/Models/MiniMax-M2.5" \
--served-model-name MiniMax-M2.5 \
--host 0.0.0.0 \
--port 8070 \
--trust-remote-code \
--tensor-parallel-size 2 \
--max-running-requests 2 \
--enable-p2p-check \
--chunked-prefill-size 32000 \
--max-total-tokens 66000 \
--mem-fraction-static 0.90 \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax-append-think \
--attention-backend triton \
--fp8-gemm-backend triton \
--kv-cache-dtype bf16 \
--disable-piecewise-cuda-graph
# When encountering performance issues, try binding threads to NUMA nodes and reducing the number of threads
How to Run GLM-5.1-FP8
pip uninstall transformers -y
pip install transformers==5.3.0
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h
PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
--model "/home/guqiong/Models/GLM-5.1-FP8" \
--served-model-name "GLM-5.1-FP8" \
--host "0.0.0.0" \
--port "8070" \
--trust-remote-code \
--tensor-parallel-size 2 \
--enable-p2p-check \
--max-running-requests 2 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--chunked-prefill-size 32000 \
--max-total-tokens 32768 \
--mem-fraction-static 0.90 \
--attention-backend triton \
--fp8-gemm-backend triton \
--kv-cache-dtype bf16 \
--disable-piecewise-cuda-graph \
--disable-shared-experts-fusion
# --nsa-prefill-backend "tilelang" \
# --nsa-decode-backend "tilelang" \
How to Run Kimi K2.5
pip uninstall transformers -y
pip install transformers==5.3.0
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h
PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
--model "/home/guqiong/Models/Kimi-K2.5" \
--served-model-name "Kimi-K2.5" \
--host "0.0.0.0" \
--port "8070" \
--trust-remote-code \
--tensor-parallel-size 2 \
--enable-p2p-check \
--max-running-requests 2 \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--chunked-prefill-size 32000 \
--max-total-tokens 32768 \
--mem-fraction-static 0.90 \
--attention-backend triton \
--fp8-gemm-backend triton \
--kv-cache-dtype bf16 \
--disable-piecewise-cuda-graph
How to Run Qwen3-Coder-Next-FP8
pip uninstall transformers -y
pip install transformers==5.3.0
PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_M
Related Skills
pestel-analysis
Analyze political, economic, social, technological, environmental, and legal forces
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
roadmap
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
50PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
