Lsglang

Lsglang is a special extension of sglang that fully utilizes CPU and GPU computing resources with an efficient GPU parallel + NUMA parallel architecture, suitable for MOE model hybrid inference.

Generate Convert Improve

Install / Use

/learn @guqiong96/Lsglang

About this skill

Quality Score

0/100

README

Lsglang GPU、NUMA Dual Parallel [中文]

Lsglang is a special extension of sglang that fully utilizes CPU and GPU computing resources with an efficient GPU parallel + NUMA parallel architecture, suitable for MoE model hybrid inference.

System Features

GPU + NUMA Dual Parallel: Supports CPU-GPU hybrid decoding, CPU-GPU hybrid prefill, and GPU prefill computing modes
VRAM + Memory Load Balancing: Total model occupancy = VRAM + Memory, accommodating 1+1=2 models with 100% VRAM utilization <sup>Note 1</sup>
GPU Prefill Optimization: GPU prefill runs in parallel with CPU-GPU hybrid decoding, achieving nearly 100% GPU utilization
NUMA Thread Optimization: Cross-node communication ratio as low as 3%, L3 cache hit rate over 50%, decoding phase can drive GPU load to 33% to 50%

Relationship with sglang

Lsglang uses the latest sglang source code and has redesigned and implemented the MoE model hybrid inference module while maintaining 100% full compatibility with sglang.<sup>Note 1</sup>.

Note 1: x86 CPUs with AVX2 or above instruction sets and Nvidia GPUs are supported.

Version Changes

2026-04-06: Lsglang-v1.2.0 - improve LK_POWER_SAVING=1 for power saving, support FP8+BF16+AWQ4bit hybrid MoE layer inference
2026-04-03: Lsglang-v1.1.4 - support local compilation of sgl-kernel, fix known issues
2026-03-11: Lsglang-v1.1.3 - FP8、AWQ4bit MoE Models enable GPU Prefill acceleration without additional memory occupation, FP8 MoE Model cancel TO_DTYPE runtime type conversion, KEEP model temporarily not support GPU Prefill
            Note 1：30 series graphics cards can enable GPU Prefill acceleration for FP8 models by removing the LVLLM_GPU_RESIDENT_MOE_LAYERS parameter.
2026-03-05: Lsglang-v1.1.0 - support GPU prefill, update corresponding commands (FP8 models do not support enabling GPU prefill on architectures below 3090)
2026-02-25: Lsglang-v1.0.6 - fix known issues, support new models 
2026-02-10: Lsglang-v1.0.0 - Ported from the LvLLM project [https://github.com/guqiong96/Lvllm], verified BF16, F16 original models, FP8 original models, and AWQ 4-bit symmetric quantization models.

How to Run Qwen3.5-122B-A10B


pip uninstall transformers -y
pip install transformers==5.3.0

PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
    --model /home/guqiong/Models/Qwen3.5-122B-A10B \
    --served-model-name Qwen3.5-122B-A10B \
    --host 0.0.0.0 \
    --port 8070 \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --max-running-requests 2 \
    --enable-p2p-check \
    --chunked-prefill-size 32000 \
    --max-total-tokens 66000 \
    --mem-fraction-static 0.90 \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --attention-backend triton \
    --fp8-gemm-backend triton \
    --kv-cache-dtype bf16 \
    --disable-piecewise-cuda-graph

How to Run Qwen3.5-397B-A17B

pip uninstall transformers -y
pip install transformers==5.3.0

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h

PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
    --model "/home/guqiong/Models/Qwen3.5-397B-A17B" \
    --served-model-name "Qwen3.5-397B-A17B" \
    --host 0.0.0.0 \
    --port 8070 \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --max-running-requests 2 \
    --enable-p2p-check \
    --chunked-prefill-size 32000 \
    --max-total-tokens 66000 \
    --mem-fraction-static 0.90 \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --attention-backend triton \
    --fp8-gemm-backend triton \
    --kv-cache-dtype bf16 \
    --disable-piecewise-cuda-graph



    # Multi-Token Prediction (MTP) \
    # --reasoning-parser qwen3 \
    # --speculative-algo NEXTN \
    # --speculative-num-steps 3 \
    # --speculative-eagle-topk 1 \
    # --speculative-num-draft-tokens 4 \
    # Processing Ultra-Long Texts
    # --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}'

How to Run MiniMax-M2.5

pip uninstall transformers -y
pip install transformers==5.3.0

PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
    --model "/home/guqiong/Models/MiniMax-M2.5" \
    --served-model-name MiniMax-M2.5 \
    --host 0.0.0.0 \
    --port 8070 \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --max-running-requests 2 \
    --enable-p2p-check \
    --chunked-prefill-size 32000 \
    --max-total-tokens 66000 \
    --mem-fraction-static 0.90 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --attention-backend triton \
    --fp8-gemm-backend triton \
    --kv-cache-dtype bf16 \
    --disable-piecewise-cuda-graph

# When encountering performance issues, try binding threads to NUMA nodes and reducing the number of threads

How to Run GLM-5.1-FP8

pip uninstall transformers -y
pip install transformers==5.3.0
 
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h

PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
    --model "/home/guqiong/Models/GLM-5.1-FP8" \
    --served-model-name "GLM-5.1-FP8" \
    --host "0.0.0.0" \
    --port "8070" \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --enable-p2p-check \
    --max-running-requests 2 \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --chunked-prefill-size 32000 \
    --max-total-tokens 32768 \
    --mem-fraction-static 0.90 \
    --attention-backend triton \
    --fp8-gemm-backend triton \
    --kv-cache-dtype bf16 \
    --disable-piecewise-cuda-graph \
    --disable-shared-experts-fusion

 
    # --nsa-prefill-backend "tilelang" \
    # --nsa-decode-backend "tilelang" \

How to Run Kimi K2.5

pip uninstall transformers -y
pip install transformers==5.3.0

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h

PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
python -m sglang.launch_server \
    --model "/home/guqiong/Models/Kimi-K2.5" \
    --served-model-name "Kimi-K2.5" \
    --host "0.0.0.0" \
    --port "8070" \
    --trust-remote-code \
    --tensor-parallel-size 2 \
    --enable-p2p-check \
    --max-running-requests 2 \
    --tool-call-parser kimi_k2 \
    --reasoning-parser kimi_k2 \
    --chunked-prefill-size 32000 \
    --max-total-tokens 32768 \
    --mem-fraction-static 0.90 \
    --attention-backend triton \
    --fp8-gemm-backend triton \
    --kv-cache-dtype bf16 \
    --disable-piecewise-cuda-graph

How to Run Qwen3-Coder-Next-FP8


pip uninstall transformers -y
pip install transformers==5.3.0

PYTORCH_ALLOC_CONF=expandable_segments:True \
SGLANG_FORCE_FP8_MARLIN=1 \
SGLANG_ENABLE_JIT_DEEPGEMM=0 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_M

Related Skills

pestel-analysis

Analyze political, economic, social, technological, environmental, and legal forces

A beautifully designed, floating Pomodoro timer that respects your workspace.

roadmap

A beautifully designed, floating Pomodoro timer that respects your workspace.

product-manager-skills

PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.

guqiong96

View profile

View on GitHub

GitHub Stars52

CategoryProduct

Updated5h ago

Forks5

guqiong96/Lsglang

Languages

Python

Security Score

100/100

Audited on Apr 9, 2026

No findings