SkillAgentSearch skills...

XFasterTransformer

No description available

Install / Use

/learn @intel/XFasterTransformer
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

xFasterTransformer

<p align="center"> <a href="./README.md">English</a> | <a href="./README_CN.md">简体中文</a> </p>

xFasterTransformer is an exceptionally optimized solution for large language models (LLM) on the X86 platform, which is similar to FasterTransformer on the GPU platform. xFasterTransformer is able to operate in distributed mode across multiple sockets and nodes to support inference on larger models. Additionally, it provides both C++ and Python APIs, spanning from high-level to low-level interfaces, making it easy to adopt and integrate.

News🔥

Table of Contents

Models overview

Large Language Models (LLMs) develops very fast and are more widely used in many AI scenarios. xFasterTransformer is an optimized solution for LLM inference using the mainstream and popular LLM models on Xeon. xFasterTransformer fully leverages the hardware capabilities of Xeon platforms to achieve the high performance and high scalability of LLM inference both on single socket and multiple sockets/multiple nodes.

xFasterTransformer provides a series of APIs, both of C++ and Python, for end users to integrate xFasterTransformer into their own solutions or services directly. Many kinds of example codes are also provided to demonstrate the usage. Benchmark codes and scripts are provided for users to show the performance. Web demos for popular LLM models are also provided.

Model support matrix

| Models | Framework | | Distribution | | :----------------: | :-------: | :------: | :----------: | | | PyTorch | C++ | | | DeepSeekR1 | ✔ | ✔ | ✔ | | DeepSeekV3 | ✔ | ✔ | ✔ | | DeepSeekV2 | ✔ | ✔ | ✔ | | ChatGLM | ✔ | ✔ | ✔ | | ChatGLM2 | ✔ | ✔ | ✔ | | ChatGLM3 | ✔ | ✔ | ✔ | | GLM4 | ✔ | ✔ | ✔ | | Llama | ✔ | ✔ | ✔ | | Llama2 | ✔ | ✔ | ✔ | | Llama3 | ✔ | ✔ | ✔ | | Baichuan | ✔ | ✔ | ✔ | | Baichuan2 | ✔ | ✔ | ✔ | | QWen | ✔ | ✔ | ✔ | | QWen2 | ✔ | ✔ | ✔ | | QWen3 | ✔ | ✔ | ✔ | | SecLLM(YaRN-Llama) | ✔ | ✔ | ✔ | | Opt | ✔ | ✔ | ✔ | | Deepseek-coder | ✔ | ✔ | ✔ | | gemma | ✔ | ✔ | ✔ | | gemma-1.1 | ✔ | ✔ | ✔ | | codegemma | ✔ | ✔ | ✔ | | TeleChat | ✔ | ✔ | ✔ | | Mixtral(MoE) | ✔ | ✔ | ✔ |

DataType support list

  • FP16
  • BF16
  • INT8
  • W8A8
  • INT4
  • NF4
  • BF16_FP16
  • BF16_INT8
  • BF16_W8A8
  • BF16_INT4
  • BF16_NF4
  • W8A8_INT8
  • W8A8_int4
  • W8A8_NF4

Documents

xFasterTransformer Documents and Wiki provides the following resources:

  • An introduction to xFasterTransformer.
  • Comprehensive API references for both high-level and low-level interfaces in C++ and PyTorch.
  • Practical API usage examples for xFasterTransformer in both C++ and PyTorch.

Installation

From PyPI

pip install xfastertransformer

Using Docker

docker pull intel/xfastertransformer:latest

Run the docker with the command (Assume model files are in /data/ directory):

docker run -it \
    --name xfastertransformer \
    --privileged \
    --shm-size=16g \
    -v /data/:/data/ \
    -e "http_proxy=$http_proxy" \
    -e "https_proxy=$https_proxy" \
    intel/xfastertransformer:latest

Notice!!!: Please enlarge --shm-size if bus error occurred while running in the multi-ranks mode. The default docker limits the shared memory size to 64MB and our implementation uses many shared memories to achieve a better performance.

Built from source

Prepare Environment

Manually
  • PyTorch v2.3 (When using the PyTorch API, it's required, but it's not needed when using the C++ API.)
    pip install torch --index-url https://download.pytorch.org/whl/cpu
    
Install dependent libraries

Please install libnuma package:

  • CentOS: yum install libnuma-devel
  • Ubuntu: apt-get install libnuma-dev
How to build
  • Using 'CMake'
    # Build xFasterTransformer
    git clone https://github.com/intel/xFasterTransformer.git xFasterTransformer
    cd xFasterTransformer
    git checkout <latest-tag>
    # Please make sure torch is installed when run python example
    mkdir build && cd build
    # Notice: use gcc-13 or higher
    cmake ..
    # If you see the error "numa.h: No such file or directory", install libnuma first, then build with "CPATH=$CONDA_PATH/include/:$CPATH make -j".
    make -j
    
  • Using python setup.py
    # Build xFasterTransformer library and C++ example.
    python setup.py build
    
    # Install xFasterTransformer into pip environment.
    # Notice: Run `python setup.py build` before installation!
    python setup.py install
    

Models Preparation

xFasterTransformer supports a different model format from Huggingface, but it's compatible with FasterTransformer's format.

  1. Download the huggingface format model firstly.

  2. After that, convert the model into xFasterTransformer format by using model convert module in xfastertransformer. If output directory is not provided, converted model will be placed into ${HF_DATASET_DIR}-xft.

    python -c "import xfastertransformer as xft; xft.DeepSeekR1Convert().convert('${HF_DATASET_DIR}', '${OUTPUT_DIR}')"
    

    PS: Due to the potential compatibility issues between the model file and the transformers version, please select the appropriate transformers version.

    Supported model convert list:

    • DeepSeekR1Convert
    • DeepSeekV3Convert
    • DeepSeekV2Convert
    • LlamaConvert
    • YiConvert
    • GemmaConvert
    • ChatGLMConvert
    • ChatGLM2Convert
    • ChatGLM3Convert
    • ChatGLM4Convert
    • OPTConvert
    • BaichuanConvert
    • Baichuan2Convert
    • QwenConvert
    • Qwen2Convert
    • Qwen3Convert
    • DeepSeekConvert
    • TelechatConvert
    • MixtralConvert

API usage

For more details, please see API document and examples.

Python API(PyTorch)

Firstly, please install the dependencies.

  • Python dependencies
    cmake==3.26.1
    sentencepiece==0.2.0
    torch==2.7.0+cpu
    transformers==4.50.0
    accelerate==1.5.1
    protobuf==5.29.3
    tiktoken==0.9.0
    
    PS: Due to the potential compatibility issues between the model file and the transformers version, please select the appropriate transformers version.
  • oneCCL (For multi ranks)
    Install oneCCL and setup the environment. Please refer to Prepare Environment.

xFasterTransformer's Python API is similar to transformers and also supports transformers's streamer to achieve the streaming output. In the example, we use transformers to encode input prompts to token ids.

import xfastertransformer
from transformers import AutoTokenizer, TextStreamer
# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-xft`.
MODEL_PATH="/data/chatglm-6b-xft"
TOKEN_PATH="/data/chatglm-6b-hf"

INPUT_PROMPT = "Once upon a time, there existed a little girl who liked to have adventures."
tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)

input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="bf16")
generated_ids = model.generate(input_ids, max_length=200, streamer=streamer)

C++ API

SentencePiece can be used to tokenizer and detokenizer text.

#include <vector>
#include <iostream>
#include "xfastertransformer.h"
// ChatGLM token ids for prompt "Once upon a time, there existed a little girl who liked to have adventures."
std::vector<int>
View on GitHub
GitHub Stars437
CategoryDevelopment
Updated8d ago
Forks75

Languages

C++

Security Score

95/100

Audited on Mar 28, 2026

No findings