Lvllm
LvLLM is a special NUMA extension of vllm that makes full use of CPU and memory resources, reduces GPU memory requirements, and features an efficient GPU parallel and NUMA parallel architecture, supporting hybrid inference for MOE large models.
Install / Use
/learn @guqiong96/LvllmREADME
LvLLM GPU and NUMA Dual Parallelism [中文说明]
LvLLM is a special extension of vLLM that fully utilizes CPU and GPU computing resources with an efficient GPU parallel + NUMA parallel architecture, suitable for MOE model hybrid inference.
System Features
- GPU + NUMA Dual Parallelism: Supports three computing modes: CPU-GPU hybrid decoding, CPU-GPU hybrid prefill, and GPU prefill
- VRAM + Memory Load Balancing: Total model footprint = VRAM + memory, accommodating 1+1=2 models, with 100% VRAM utilization <sup>Note 1</sup>
- GPU Prefill Optimization: GPU prefill runs in parallel with CPU-GPU hybrid decoding, achieving nearly 100% GPU utilization
- NUMA Thread Optimization: Cross-node communication reduced to as low as 3%, L3 cache hit rate over 50%, GPU load can reach 33% to 50% during decoding
Relationship with vLLM
Lvllm uses the latest vLLM source code and has redesigned and implemented MOE model hybrid inference modules, maintaining 100% full compatibility with vLLM<sup>Note 1</sup>.
Note 1: x86 CPUs with AVX2 or above instruction sets and Nvidia GPUs are supported.
Usage Instructions [中文说明]
- Version Changes
- Supported Models
- Performance Reference
- How to Run gemma-4-26B-A4B-it
- How to Run NVIDIA-Nemotron-3-Super-120B-A12B-BF16
- How to Run Qwen3.5-122B-A10B
- How to Run Qwen3.5-397B-A17B
- How to Run MiniMax-M2.5
- How to Run Kimi-K2.5
- How to Run GLM-4.7-FP8
- Configuration Parameters
- Installation Steps
- Update
- Optimization Tips
Version Changes
2026-03-22: lvllm-v2.0.0 - FP8 MoE models support layer-wise loading when quantizing INT4 experts, reducing peak memory usage, LVLLM_ENABLE_MOE_LAYERWISEISE_LOAD=1
2026-03-19: lvllm-v1.9.10 - fix known issues,Supports the new moe model type, which does not have gate_proj, for example: NVIDIA-Nemotron-3-Super-120B-A12B-BF16
2026-03-11: lvllm-v1.9.2 - FP8、AWQ4bit MoE Models enable GPU Prefill acceleration without additional memory occupation, FP8 MoE Model cancel TO_DTYPE runtime type conversion, KEEP model temporarily not support GPU Prefill
2026-03-05: lvllm-v1.9.0 - Optimize GPU prefill and regular prefill to ensure output quality
2026-03-01: lvllm-v1.8.10 - fix known issues, support new models
2026-02-02:lvllm-v1.7.0 - support for EP parallelism, 8-card running minimax-m2.1 model requires setting --enable_expert_parallel
2026-01-26: lvllm-v1.6.1 - fp8 model support for FP8 + INT4 inference, support for GPU Prefill acceleration (high memory usage!)
2026-01-25: lvllm-v1.6.0 - fp8 model support for GPU Prefill acceleration (high memory usage!)
2026-01-24: lvllm-v1.5.8 - AWQ 4-bit symmetric quantized model support for GPU Prefill acceleration
2026-01-21: lvllm-v1.5.7 - Fixed numerical calculation stability issues in MiniMax-M2.1 model
2026-01-08: lvllm-v1.5.1 - For long context scenarios, supports separation of prefill and decoding, GPU prefill runs in parallel with CPU-GPU hybrid decoding
2026-01-04: v1.4.0 Optimized decode speed
2025-12-28: Optimized inference speed: bfloat16, awq4bit; optimized NUMA data access for multi-GPU; enabled NUMA nodes for multi-GPU for best performance; removed GGUF model support
2025-12-16 v1.2.0 Synchronized upstream vllm code to latest, optimized lk_moe to reduce memory usage
2025-12-14 v1.1.2 Added AWQ-4bit symmetric quantized model inference support
2025-12-9: Added LVLLM_MOE_USE_WEIGHT environment variable, supporting MOE modules to use two modes for fp8 model inference:
2025-11-1: Supports tensor parallelism, pipeline multi-card inference https://b23.tv/xzHieMs
2025-10-30: Supports Qwen3 series model GGUF hybrid inference (excluding Qwen3-Coder-30B-A3B-Instruct GGUF) [Check new parameters in config.yaml]
2025-10-19: FP8 supports GPU+NUMA hybrid inference for MOE models!! [VRAM FP8 precision, memory FP16 precision] Verified with GLM-4.5-Air-FP8
2025-10-14: Enabled cuda graph, decode speed doubled!! Output quality improved!!
2025-09-30 Verified: Qwen3-Next-80B-A3B-Instruct, Qwen3-Coder-30B-A3B-Instruct
Supported Models
Most of the original MOE models verified by vLLM
| Model Name | Status | |---------|------| | gemma-4-26B-A4B-it | ✅ Tested | | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | ✅ Tested | | Qwen3.5-35B-A3B | ✅ Tested | | Qwen3.5-122B-A10B | ✅ Tested | | Qwen3.5-397B-A17B | ✅ Tested | | Qwen3-Coder-Next | ✅ Tested | | Qwen3-Next-80B-A3B-Instruct | ✅ Tested | | Qwen3-Coder-30B-A3B-Instruct | ✅ Tested | | Qwen3-VL-30B-A3B-Instruct | ✅ Tested | | MiniMax-M2.5 | ✅ Tested | | MiniMax-M2.1 | ✅ Tested | | GLM-4.7 | ✅ Tested | | GLM-4.7-Flash | ✅ Tested | | GLM-4.6V | ✅ Tested | | Kimi k2.5 | ✅ Tested |
Unlisted original MOE models from Qwen3 series, GLM series, and MiniMax series are theoretically supported and pending actual testing.
Unsupported Models
| Model Name | Status | |---------|------| | DeepSeek-V3.2| Pending |
Supported Model Weight Formats and Runtime Formats
| Model File | Runtime Format | |---------|------------| | bfloat16 | bfloat16/float16| | float16 | bfloat16/float16| | fp8 model | fp8, fp8+int4 | | awq 4bit symmetric quantized model <sup>Note 1</sup> | int4 |
Note 1: https://hf-mirror.com/cyankiwi provides AWQ 4bit symmetric quantized models
Performance Reference
| Model | Runtime Format | Prefill Speed (tokens/s) | Decode Speed (tokens/s) | CPU | GPU | Memory | |------|----------|---------------------|-------------------|----------|---------|---------| | Qwen3-Next-80B-A3B-Instruct Original | bfloat16 |15000 <sup>Note 1</sup> | 90 | Dual EPYC 9555ES | Single Nvidia RTX Pro 6000 | 6400MT/s | | MiniMax-M2.1 Original | fp8+bfloat16 | 5000 <sup>Note 1</sup> | 29 | Dual EPYC 9684x | Single Nvidia RTX 5090 | 4800MT/s |
Note 1: Enabling GPU Prefill, Input Length 32K-64K
How to Run gemma-4-26b-a-4b-it
pip uninstall transformers -y
pip install transformers==5.5.0
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
VLLM_SKIP_P2P_CHECK=1 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
vllm serve \
--model /home/guqiong/Models/gemma-4-26B-A4B-it \
--host 0.0.0.0 \
--port 8070 \
--tensor-parallel-size 2 \
--max-model-len 160000 \
--gpu-memory-utilization 0.9046 \
--trust-remote-code \
--tokenizer-mode auto \
--served-model-name gemma-4-26B-A4B-it \
--compilation_config.cudagraph_mode FULL_DECODE_ONLY \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 32000 \
--max-num-seqs 2 \
--compilation_config.mode VLLM_COMPILE \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4
How to Run NVIDIA-Nemotron-3-Super-120B-A12B-BF16
pip uninstall transformers -y
pip install transformers==4.57.6
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
VLLM_SKIP_P2P_CHECK=1 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
vllm serve \
--model /home/guqiong/Models/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
--host 0.0.0.0 \
--port 8070 \
--tensor-parallel-size 2 \
--max-model-len 52000 \
--gpu-memory-utilization 0.9046 \
--trust-remote-code \
--tokenizer-mode auto \
--served-model-name NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
--compilation_config.cudagraph_mode FULL_DECODE_ONLY \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 32000 \
--max-num-seqs 2 \
--compilation_config.mode VLLM_COMPILE \
--enable-auto-tool-choice \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder
How to Run Qwen3.5-122B-A10B
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h
pip uninstall transformers
pip install transformers==4.57.6
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
VLLM_SKIP_P2P_CHECK=1 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_ENABLE_MOE_LAYERWISEISE_LOAD=1 \
vllm serve \
--model /home/guqiong/Models/Qwen3.5-122B-A10B \
--host 0.0.0.0 \
--port 8070 \
--tensor-parallel-size 2 \
--max-model-len 40000 \
--gpu-memory-utilization 0.9046 \
--trust-remote-code \
--tokenizer-mode auto \
--served-model-name Qwen3.5-122B-A10B \
--compilation_config.cudagraph_mode FULL_DECODE_ONLY \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--max-num-seqs 2 \
--compilation_config.mode VLLM_COMPILE \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3
How to Run Qwen3.5-397B-A17B
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h
pip uninstall transformer
Related Skills
pestel-analysis
Analyze political, economic, social, technological, environmental, and legal forces
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
45PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
