SkillAgentSearch skills...

Lvllm

LvLLM is a special NUMA extension of vllm that makes full use of CPU and memory resources, reduces GPU memory requirements, and features an efficient GPU parallel and NUMA parallel architecture, supporting hybrid inference for MOE large models.

Install / Use

/learn @guqiong96/Lvllm
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

LvLLM GPU and NUMA Dual Parallelism [中文说明]

LvLLM is a special extension of vLLM that fully utilizes CPU and GPU computing resources with an efficient GPU parallel + NUMA parallel architecture, suitable for MOE model hybrid inference.

System Features

  • GPU + NUMA Dual Parallelism: Supports three computing modes: CPU-GPU hybrid decoding, CPU-GPU hybrid prefill, and GPU prefill
  • VRAM + Memory Load Balancing: Total model footprint = VRAM + memory, accommodating 1+1=2 models, with 100% VRAM utilization <sup>Note 1</sup>
  • GPU Prefill Optimization: GPU prefill runs in parallel with CPU-GPU hybrid decoding, achieving nearly 100% GPU utilization
  • NUMA Thread Optimization: Cross-node communication reduced to as low as 3%, L3 cache hit rate over 50%, GPU load can reach 33% to 50% during decoding

Relationship with vLLM

Lvllm uses the latest vLLM source code and has redesigned and implemented MOE model hybrid inference modules, maintaining 100% full compatibility with vLLM<sup>Note 1</sup>.

Note 1: x86 CPUs with AVX2 or above instruction sets and Nvidia GPUs are supported.

Usage Instructions [中文说明]

Version Changes

2026-03-22: lvllm-v2.0.0 - FP8 MoE models support layer-wise loading when quantizing INT4 experts, reducing peak memory usage, LVLLM_ENABLE_MOE_LAYERWISEISE_LOAD=1
2026-03-19: lvllm-v1.9.10 - fix known issues,Supports the new moe model type, which does not have gate_proj, for example: NVIDIA-Nemotron-3-Super-120B-A12B-BF16
2026-03-11: lvllm-v1.9.2 - FP8、AWQ4bit MoE Models enable GPU Prefill acceleration without additional memory occupation, FP8 MoE Model cancel TO_DTYPE runtime type conversion, KEEP model temporarily not support GPU Prefill
2026-03-05: lvllm-v1.9.0 - Optimize GPU prefill and regular prefill to ensure output quality
2026-03-01: lvllm-v1.8.10 - fix known issues, support new models
2026-02-02:lvllm-v1.7.0 - support for EP parallelism, 8-card running minimax-m2.1 model requires setting --enable_expert_parallel
2026-01-26: lvllm-v1.6.1 - fp8 model support for FP8 + INT4 inference, support for GPU Prefill acceleration (high memory usage!)
2026-01-25: lvllm-v1.6.0 - fp8 model support for GPU Prefill acceleration (high memory usage!)
2026-01-24: lvllm-v1.5.8 - AWQ 4-bit symmetric quantized model support for GPU Prefill acceleration
2026-01-21: lvllm-v1.5.7 - Fixed numerical calculation stability issues in MiniMax-M2.1 model
2026-01-08: lvllm-v1.5.1 - For long context scenarios, supports separation of prefill and decoding, GPU prefill runs in parallel with CPU-GPU hybrid decoding
2026-01-04: v1.4.0 Optimized decode speed
2025-12-28: Optimized inference speed: bfloat16, awq4bit; optimized NUMA data access for multi-GPU; enabled NUMA nodes for multi-GPU for best performance; removed GGUF model support
2025-12-16 v1.2.0 Synchronized upstream vllm code to latest, optimized lk_moe to reduce memory usage
2025-12-14 v1.1.2 Added AWQ-4bit symmetric quantized model inference support
2025-12-9: Added LVLLM_MOE_USE_WEIGHT environment variable, supporting MOE modules to use two modes for fp8 model inference:
2025-11-1: Supports tensor parallelism, pipeline multi-card inference https://b23.tv/xzHieMs
2025-10-30: Supports Qwen3 series model GGUF hybrid inference (excluding Qwen3-Coder-30B-A3B-Instruct GGUF) [Check new parameters in config.yaml]
2025-10-19: FP8 supports GPU+NUMA hybrid inference for MOE models!! [VRAM FP8 precision, memory FP16 precision] Verified with GLM-4.5-Air-FP8
2025-10-14: Enabled cuda graph, decode speed doubled!! Output quality improved!!
2025-09-30 Verified: Qwen3-Next-80B-A3B-Instruct, Qwen3-Coder-30B-A3B-Instruct

Supported Models

Most of the original MOE models verified by vLLM

| Model Name | Status | |---------|------| | gemma-4-26B-A4B-it | ✅ Tested | | NVIDIA-Nemotron-3-Super-120B-A12B-BF16 | ✅ Tested | | Qwen3.5-35B-A3B | ✅ Tested | | Qwen3.5-122B-A10B | ✅ Tested | | Qwen3.5-397B-A17B | ✅ Tested | | Qwen3-Coder-Next | ✅ Tested | | Qwen3-Next-80B-A3B-Instruct | ✅ Tested | | Qwen3-Coder-30B-A3B-Instruct | ✅ Tested | | Qwen3-VL-30B-A3B-Instruct | ✅ Tested | | MiniMax-M2.5 | ✅ Tested | | MiniMax-M2.1 | ✅ Tested | | GLM-4.7 | ✅ Tested | | GLM-4.7-Flash | ✅ Tested | | GLM-4.6V | ✅ Tested | | Kimi k2.5 | ✅ Tested |

Unlisted original MOE models from Qwen3 series, GLM series, and MiniMax series are theoretically supported and pending actual testing.

Unsupported Models

| Model Name | Status | |---------|------| | DeepSeek-V3.2| Pending |

Supported Model Weight Formats and Runtime Formats

| Model File | Runtime Format | |---------|------------| | bfloat16 | bfloat16/float16| | float16 | bfloat16/float16| | fp8 model | fp8, fp8+int4 | | awq 4bit symmetric quantized model <sup>Note 1</sup> | int4 |

Note 1: https://hf-mirror.com/cyankiwi provides AWQ 4bit symmetric quantized models

Performance Reference

| Model | Runtime Format | Prefill Speed (tokens/s) | Decode Speed (tokens/s) | CPU | GPU | Memory | |------|----------|---------------------|-------------------|----------|---------|---------| | Qwen3-Next-80B-A3B-Instruct Original | bfloat16 |15000 <sup>Note 1</sup> | 90 | Dual EPYC 9555ES | Single Nvidia RTX Pro 6000 | 6400MT/s | | MiniMax-M2.1 Original | fp8+bfloat16 | 5000 <sup>Note 1</sup> | 29 | Dual EPYC 9684x | Single Nvidia RTX 5090 | 4800MT/s |

Note 1: Enabling GPU Prefill, Input Length 32K-64K

How to Run gemma-4-26b-a-4b-it

pip uninstall transformers -y
pip install transformers==5.5.0

VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
VLLM_SKIP_P2P_CHECK=1 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
vllm serve \
    --model /home/guqiong/Models/gemma-4-26B-A4B-it \
    --host 0.0.0.0 \
    --port 8070 \
    --tensor-parallel-size 2 \
    --max-model-len 160000 \
    --gpu-memory-utilization 0.9046 \
    --trust-remote-code \
    --tokenizer-mode auto \
    --served-model-name gemma-4-26B-A4B-it \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 32000 \
    --max-num-seqs 2 \
    --compilation_config.mode VLLM_COMPILE \
    --enable-auto-tool-choice \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4

How to Run NVIDIA-Nemotron-3-Super-120B-A12B-BF16

pip uninstall transformers -y
pip install transformers==4.57.6

VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
VLLM_SKIP_P2P_CHECK=1 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
vllm serve \
    --model /home/guqiong/Models/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    --host 0.0.0.0 \
    --port 8070 \
    --tensor-parallel-size 2 \
    --max-model-len 52000 \
    --gpu-memory-utilization 0.9046 \
    --trust-remote-code \
    --tokenizer-mode auto \
    --served-model-name NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 32000 \
    --max-num-seqs 2 \
    --compilation_config.mode VLLM_COMPILE \
    --enable-auto-tool-choice \
    --reasoning-parser qwen3 \
    --tool-call-parser qwen3_coder

How to Run Qwen3.5-122B-A10B

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h

pip uninstall transformers
pip install transformers==4.57.6

VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 \
VLLM_TEST_FORCE_FP8_MARLIN=1 \
NCCL_SOCKET_IFNAME=lo \
NCCL_IB_DISABLE=1 \
GLOO_SOCKET_IFNAME=lo \
NCCL_SOCKET_TIMEOUT=600000 \
VLLM_SKIP_P2P_CHECK=1 \
LVLLM_MOE_NUMA_ENABLED=1 \
LK_THREAD_BINDING=CPU_CORE \
LK_THREADS=44 \
OMP_NUM_THREADS=44 \
LVLLM_MOE_USE_WEIGHT=INT4 \
LVLLM_GPU_PREFETCH_WINDOW=1 \
LVLLM_GPU_PREFILL_MIN_BATCH_SIZE=2048 \
LVLLM_ENABLE_NUMA_INTERLEAVE=1 \
LVLLM_MOE_QUANT_ON_GPU=1 \
LVLLM_ENABLE_MOE_LAYERWISEISE_LOAD=1 \
vllm serve \
    --model /home/guqiong/Models/Qwen3.5-122B-A10B \
    --host 0.0.0.0 \
    --port 8070 \
    --tensor-parallel-size 2 \
    --max-model-len 40000 \
    --gpu-memory-utilization 0.9046 \
    --trust-remote-code \
    --tokenizer-mode auto \
    --served-model-name Qwen3.5-122B-A10B \
    --compilation_config.cudagraph_mode FULL_DECODE_ONLY \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 2 \
    --compilation_config.mode VLLM_COMPILE \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

How to Run Qwen3.5-397B-A17B

sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
free -h

pip uninstall transformer

Related Skills

View on GitHub
GitHub Stars309
CategoryProduct
Updated7h ago
Forks31

Languages

Python

Security Score

100/100

Audited on Apr 4, 2026

No findings