Lvllm

LvLLM is a special NUMA extension of vllm that makes full use of CPU and memory resources, reduces GPU memory requirements, and features an efficient GPU parallel and NUMA parallel architecture, supporting hybrid inference for MOE large models.

Generate Convert Improve

Install / Use

/learn @guqiong96/Lvllm

About this skill

Quality Score

0/100

README

LvLLM GPU and NUMA Dual Parallelism [中文说明]

LvLLM is a special extension of vLLM that fully utilizes CPU and GPU computing resources with an efficient GPU parallel + NUMA parallel architecture, suitable for MOE model hybrid inference.

System Features

GPU + NUMA Dual Parallelism: Supports three computing modes: CPU-GPU hybrid decoding, CPU-GPU hybrid prefill, and GPU prefill
VRAM + Memory Load Balancing: Total model footprint = VRAM + memory, accommodating 1+1=2 models, with 100% VRAM utilization Note 1
GPU Prefill Optimization: GPU prefill runs in parallel with CPU-GPU hybrid decoding, achieving nearly 100% GPU utilization
NUMA Thread Optimization: Cross-node communication reduced to as low as 3%, L3 cache hit rate over 50%, GPU load can reach 33% to 50% during decoding

Relationship with vLLM

Lvllm uses the latest vLLM source code and has redesigned and implemented MOE model hybrid inference modules, maintaining 100% full compatibility with vLLMNote 1.

Note 1: x86 CPUs with AVX2 or above instruction sets and Nvidia GPUs are supported.

Usage Instructions [中文说明]

Version Changes
Supported Models
Performance Reference
How to Run gemma-4-26B-A4B-it
How to Run NVIDIA-Nemotron-3-Super-120B-A12B-BF16
How to Run Qwen3.5-122B-A10B
How to Run Qwen3.5-397B-A17B
How to Run MiniMax-M2.5
How to Run Kimi-K2.5
How to Run GLM-4.7-FP8
Configuration Parameters
Installation Steps
Update
Optimization Tips

Version Changes

2026-03-22: lvllm-v2.0.0 - FP8 MoE models support layer-wise loading when quantizing INT4 experts, reducing peak memory usage, LVLLM_ENABLE_MOE_LAYERWISEISE_LOAD=1
2026-03-19: lvllm-v1.9.10 - fix known issues，Supports the new moe model type, which does not have gate_proj, for example: NVIDIA-Nemotron-3-Super-120B-A12B-BF16
2026-03-11: lvllm-v1.9.2 - FP8、AWQ4bit MoE Models enable GPU Prefill acceleration without additional memory occupation, FP8 MoE Model cancel TO_DTYPE runtime type conversion, KEEP model temporarily not support GPU Prefill
2026-03-05: lvllm-v1.9.0 - Optimize GPU prefill and regular prefill to ensure output quality
2026-03-01: lvllm-v1.8.10 - fix known issues, support new models
2026-02-02：lvllm-v1.7.0 - support for EP parallelism, 8-card running minimax-m2.1 model requires setting --enable_expert_parallel
2026-01-26: lvllm-v1.6.1 - fp8 model support for FP8 + INT4 inference, support for GPU Prefill acceleration (high memory usage!)
2026-01-25: lvllm-v1.6.0 - fp8 model support for GPU Prefill acceleration (high memory usage!)
2026-01-24: lvllm-v1.5.8 - AWQ 4-bit symmetric quantized model support for GPU Prefill acceleration
2026-01-21: lvllm-v1.5.7 - Fixed numerical calculation stability issues in MiniMax-M2.1 model
2026-01-08: lvllm-v1.5.1 - For long context scenarios, supports separation of prefill and decoding, GPU prefill runs in parallel with CPU-GPU hybrid decoding
2026-01-04: v1.4.0 Optimized decode speed
2025-12-28: Optimized inference speed: bfloat16, awq4bit; optimized NUMA data access for multi-GPU; enabled NUMA nodes for multi-GPU for best performance; removed GGUF model support
2025-12-16 v1.2.0 Synchronized upstream vllm code to latest, optimized lk_moe to reduce memory usage
2025-12-14 v1.1.2 Added AWQ-4bit symmetric quantized model inference support
2025-12-9: Added LVLLM_MOE_USE_WEIGHT environment variable, supporting MOE modules to use two modes for fp8 model inference:
2025-11-1: Supports tensor parallelism, pipeline multi-card inference https://b23.tv/xzHieMs
2025-10-30: Supports Qwen3 series model GGUF hybrid inference (excluding Qwen3-Coder-30B-A3B-Instruct GGUF) [Check new parameters in config.yaml]
2025-10-19: FP8 supports GPU+NUMA hybrid inference for MOE models!! [VRAM FP8 precision, memory FP16 precision] Verified with GLM-4.5-Air-FP8
2025-10-14: Enabled cuda graph, decode speed doubled!! Output quality improved!!
2025-09-30 Verified: Qwen3-Next-80B-A3B-Instruct, Qwen3-Coder-30B-A3B-Instruct

Supported Models

Most of the original MOE models verified by vLLM

| Model Name | Status | |---------|------| | gemma-4-26B-A4B-it | ✅ Tested | | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | ✅ Tested | | Qwen3.5-35B-A3B | ✅ Tested | | Qwen3.5-122B-A10B | ✅ Tested | | Qwen3.5-397B-A17B | ✅ Tested | | Qwen3-Coder-Next | ✅ Tested | | Qwen3-Next-80B-A3B-Instruct | ✅ Tested | | Qwen3-Coder-30B-A3B-Instruct | ✅ Tested | | Qwen3-VL-30B-A3B-Instruct | ✅ Tested | | MiniMax-M2.5 | ✅ Tested | | MiniMax-M2.1 | ✅ Tested | | GLM-4.7 | ✅ Tested | | GLM-4.7-Flash | ✅ Tested | | GLM-4.6V | ✅ Tested | | Kimi k2.5 | ✅ Tested |

Unlisted original MOE models from Qwen3 series, GLM series, and MiniMax series are theoretically supported and pending actual testing.

Unsupported Models

| Model Name | Status | |---------|------| | DeepSeek-V3.2| Pending |

Supported Model Weight Formats and Runtime Formats

| Model File | Runtime Format | |---------|------------| | bfloat16 | bfloat16/float16| | float16 | bfloat16/float16| | fp8 model | fp8, fp8+int4 | | awq 4bit symmetric quantized model Note 1 | int4 |

Note 1: https://hf-mirror.com/cyankiwi provides AWQ 4bit symmetric quantized models

Performance Reference

| Model | Runtime Format | Prefill Speed (tokens/s) | Decode Speed (tokens/s) | CPU | GPU | Memory | |------|----------|---------------------|-------------------|----------|---------|---------| | Qwen3-Next-80B-A3B-Instruct Original | bfloat16 |15000 Note 1 | 90 | Dual EPYC 9555ES | Single Nvidia RTX Pro 6000 | 6400MT/s | | MiniMax-M2.1 Original | fp8+bfloat16 | 5000 Note 1 | 29 | Dual EPYC 9684x | Single Nvidia RTX 5090 | 4800MT/s |

Note 1: Enabling GPU Prefill, Input Length 32K-64K

How to Run gemma-4-26b-a-4b-it

pip uninstall transformers -y
pip install transformers==5.5.0

VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
VLLM_SKIP_P2P_CHECK=1 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
vllm serve \
    --model /home/guqiong/Models/gemma-4-26B-A4B-it \
    --host 0.0.0.0 \
    --port 8070 \
    --tensor-parallel-size 2 \
    --max-model-len 160000 \
    --gpu-memory-utilization 0.9046 \
    --trust-remote-code \
    --tokenizer-mode auto \
    --served-model-name gemma-4-26B-A4B-it \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 32000 \
    --max-num-seqs 2 \
    --compilation_config.mode VLLM_COMPILE \
    --enable-auto-tool-choice \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4

How to Run NVIDIA-Nemotron-3-Super-120B-A12B-BF16

pip uninstall transformers -y
pip install transformers==4.57.6

VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
VLLM_SKIP_P2P_CHECK=1 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
vllm serve \
    --model /home/guqiong/Models/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    --host 0.0.0.0 \
    --port 8070 \
    --tensor-parallel-size 2 \
    --max-model-len 52000 \
    --gpu-memory-utilization 0.9046 \
    --trust-remote-code \
    --tokenizer-mode auto \
    --served-model-name NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 32000 \
    --max-num-seqs 2 \
    --compilation_config.mode VLLM_COMPILE \
    --enable-auto-tool-choice \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

How to Run Qwen3.5-122B-A10B

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h

pip uninstall transformers
pip install transformers==4.57.6

VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
VLLM_SKIP_P2P_CHECK=1 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_ENABLE_MOE_LAYERWISEISE_LOAD=1 \
vllm serve \
    --model /home/guqiong/Models/Qwen3.5-122B-A10B \
    --host 0.0.0.0 \
    --port 8070 \
    --tensor-parallel-size 2 \
    --max-model-len 40000 \
    --gpu-memory-utilization 0.9046 \
    --trust-remote-code \
    --tokenizer-mode auto \
    --served-model-name Qwen3.5-122B-A10B \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 2 \
    --compilation_config.mode VLLM_COMPILE \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

How to Run Qwen3.5-397B-A17B

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h

pip uninstall transformer

Related Skills

pestel-analysis

Analyze political, economic, social, technological, environmental, and legal forces

A beautifully designed, floating Pomodoro timer that respects your workspace.

product-manager-skills

PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.

devplan-mcp-server

MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.

guqiong96

View profile

View on GitHub

GitHub Stars309

CategoryProduct

Updated7h ago

Forks31

guqiong96/Lvllm

Languages

Python

Security Score

100/100

Audited on Apr 4, 2026

No findings