FlexLLM

Composable HLS library for rapid development of LLM accelerators. FlexLLM enables spatial-temporal hybrid architectures, with parameterized modulet templates customized for the prefill and decode stages and a comprehensive quantization suite for hardware-efficient yet accurate deployment.

Generate Convert Improve

Install / Use

/learn @Crazy-James26/FlexLLM

About this skill

Quality Score

0/100

README

🔥 FlexLLM: A Composable HLS Library for Rapid LLM Accelerator Design

FlexLLM is a composable High-Level Synthesis (HLS) library for rapidly building hybrid temporal–spatial accelerators for Large Language Models (LLMs).
It provides parameterized module templates, optimized memory-access/dataflow components, and a complete quantization suite, enabling FPGA-based LLM systems to be built with minimal manual engineering effort.

Using FlexLLM, we implemented a full Llama-3.2-1B inference system—including prefill, decode, tokenizer integration, and long-context memory—in under two months with ~1K lines of code.

✨ Key Features

Composable HLS Library for LLM accelerator development
Hybrid Temporal–Spatial Architecture
Hardware-Efficient Quantization Suite
Hierarchical Memory Transformer (HMT) Plug-In
FPGA Deployment Ready

📊 Performance Summary

AMD U280 FPGA (16nm) vs. NVIDIA A100 GPU (7nm)

1.29× end-to-end speedup
1.64× higher decode throughput
3.14× better energy efficiency

Projected V80 FPGA (7nm)

4.71× end-to-end speedup
6.55× decode throughput
4.13× energy efficiency

Long-Context (with HMT)

23.23× reduced prefill latency
64× longer context window

📁 Repository Layout

FlexLLM/
├─ Modules/                          # Core FlexLLM module library (compute, quant, memory, data movement)
│
├─ SpinQuant_Llama_32_1B_Ins/        # Llama-3.2-1B-Instruct accelerator (SpinQuant)
│  ├─ parameters/                    # Downloaded model parameters
│  ├─ RapidStream_pref_u280/         # Prefill RapidStream config (U280)
│  ├─ RapidStream_dec_u280/          # Decode RapidStream config (U280)
│  ├─ run/                           # Bitstreams, hosts, and test scripts
│  │  ├─ bitstreams/                 # FPGA .xclbin files
│  │  ├─ parameters/                 # Downloaded parameters
│  │  ├─ llama-3.2-1b-f16.gguf       # Tokenizer (download required)
│  │  ├─ SpinQuant_Prefilling_Decoding_mem_opt
│  │  ├─ SpinQuant_Prefilling_Decoding_mem_opt_demo
│  │  └─ test files (.py/.txt/.csv)
│  └─ TAPA files                     # TAPA HLS kernels, host code, memory configs
│
├─ SpinQuant_Llama_32_1B/            # Llama-3.2-1B accelerator (SpinQuant)
├─ HMT_SpinQuant_Llama_32_1B/        # Llama-3.2-1B-Instruct + SpinQuant + HMT
└─ README.md

📦 Download Required Files

Download parameters & GGUF from:

https://drive.google.com/drive/folders/1KyEL9gC9Wge9l1m5t2lc79uQhK0jYyq8?usp=sharing

Place them in:

FlexLLM/SpinQuant_Llama_32_1B_Ins/parameters/
FlexLLM/SpinQuant_Llama_32_1B_Ins/run/parameters/
FlexLLM/SpinQuant_Llama_32_1B_Ins/run/llama-3.2-1b-f16.gguf

🧰 Requirements

Ubuntu 20.04 / 22.04
XRT installed
Vitis 2022.2
TAPA CLI
Compatible FPGA board

Check FPGA:

xbutil examine

🛠 Build (Host Only)

export FLEXLLM_HOME=/path/to/FlexLLM
export LLAMA_CPP_ROOT=/path/to/llama.cpp

tapa g++ -- SpinQuant_Prefilling_Decoding_mem_opt_demo.cpp   -I$FLEXLLM_HOME/Modules   -I$LLAMA_CPP_ROOT   -I$LLAMA_CPP_ROOT/include   -I$LLAMA_CPP_ROOT/ggml/include   -I$LLAMA_CPP_ROOT/ggml/include/ggml   $LLAMA_CPP_ROOT/build/bin/libllama.so   -Wl,-rpath,$LLAMA_CPP_ROOT/build/bin   -lpthread -ldl -lm   -o run/SpinQuant_Prefilling_Decoding_mem_opt_demo

🚀 Run on U280

./SpinQuant_Prefilling_Decoding_mem_opt_demo   --bitstream_pref bitstreams/SpinQuant_Prefilling_mem_opt_xilinx_u280_gen3x16_xdma_1_202211_1.xclbin   --bitstream_dec  bitstreams/SpinQuant_Decoding_mem_opt_xilinx_u280_gen3x16_xdma_1_202211_1.xclbin   llama-3.2-1b-f16.gguf my_prompt.txt my_answer.txt

📝 Notes for V80 Support

V80 results are estimates. Full bitstreams coming soon.

🙏 Acknowledgments

We thank AMD — Fraser Nicholas and Michaela Blott — for support and guidance.

Related Skills

node-connect

352.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。