FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.

Generate Convert Improve

Install / Use

/learn @FMInference/FlexLLMGen

About this skill

Quality Score

0/100

README

FlexLLMGen: High-throughput Generative Inference of Large Language Models with a Single GPU [paper]

FlexLLMGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexLLMGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.

Motivation

In recent years, large language models (LLMs) have shown great performance across a wide range of tasks. Increasingly, LLMs have been applied not only to interactive applications (such as chat), but also to many "back-of-house" tasks. These tasks include benchmarking, information extraction, data wrangling, and form processing.

One key characteristic of these applications is that they are throughput-oriented: they require running LLM inferences over millions of tokens in batches, e.g., all the private documents in a company's corpus, or all the tasks in the HELM benchmark. These workloads are less sensitive to latency - the user starts up a job and lets it run overnight - but increasing throughput is critical for reducing costs. Throughput is a measure of tokens processed per second over the job's entire runtime (which can be hours). Throughput-oriented workloads provide opportunities to trade off latency for higher throughput, which makes it easier to take advantage of low-cost commodity GPUs.

The goal of FlexLLMGen is to create a high-throughput system to enable new and exciting applications of foundation models to throughput-oriented tasks on low-cost hardware, such as a single commodity GPU instead of expensive systems.

Check out the examples of what you can run on a single commodity GPU with FlexLLMGen, including benchmarking and data wrangling.

❌ Limitation. As an offloading-based system running on weak GPUs, FlexLLMGen also has its limitations. FlexLLMGen can be significantly slower than the case when you have enough powerful GPUs to hold the whole model, especially for small-batch cases. FlexLLMGen is mostly optimized for throughput-oriented batch processing settings (e.g., classifying or extracting information from many documents in batches), on single GPUs.

This project was made possible thanks to a collaboration with

Installation

Requirements:

PyTorch >= 1.12 (Help)

Method 1: With pip

pip install flexllmgen

Method 2: From source

git clone https://github.com/FMInference/FlexLLMGen.git
cd FlexLLMGen
pip install -e .

Usage and Examples

Get Started with a Single GPU

OPT-1.3B

To get started, you can try a small model like OPT-1.3B first. It fits into a single GPU so no offloading is required. FlexLLMGen will automatically download weights from Hugging Face.

python3 -m flexllmgen.flex_opt --model facebook/opt-1.3b

You should see some text generated by OPT-1.3B and the benchmark results.

OPT-30B

To run large models like OPT-30B, you will need to use CPU offloading. You can try commands below. The --percent argument specifies the offloading strategy for parameters, attention cache and hidden states separately. The exact meaning of this argument can be found here.

python3 -m flexllmgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0

OPT-175B

To run OPT-175B, you need to download the weights from metaseq and convert the weights into Alpa format. You can then try to offloading all weights to disk by

python3 -m flexllmgen.flex_opt --model facebook/opt-175b --percent 0 0 100 0 100 0 --offload-dir YOUR_SSD_FOLDER

Run HELM Benchmark with FlexLLMGen

FlexLLMGen can be integrated into HELM, a language model benchmark framework, as its execution backend. You can use the commands below to run a Massive Multitask Language Understanding (MMLU) scenario with a single T4 (16GB) GPU and 200GB of DRAM.

pip install crfm-helm
python3 -m flexllmgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100

Note that only a subset of HELM scenarios is tested. See more tested scenarios here.

Run Data Wrangling Tasks with FlexLLMGen

You can run the examples in this paper, 'Can Foundation Models Wrangle Your Data?', by following the instructions here.

Scaling to Distributed GPUs

If you have multiple machines with GPUs, FlexLLMGen can combine offloading with pipeline parallelism to allow scaling. For example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexLLMGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation. But to have scaled performance, you should have GPUs on distributed machines. See examples here.

API Example

We demonstrate the usage of FlexLLMGen API in completion.py. This example shows how to run generation for two sentences. To get the best throughput out of FlexLLMGen, you typically need to batch more sentences.

Generation API

FlexLLMGen has a generation API following the style of Hugging Face's transformers.

output_ids = model.generate(
	input_ids,
	do_sample=True,
	temperature=0.7,
	max_new_tokens=32,
	stop=stop)

Example Commands

You can use the example commands below. If you do not have enough GPU/CPU memory, see the Handle Out-Of-Memory section.

# Complete with OPT-6.7B. You need at least 15GB of GPU memory.
python3 -m flexllmgen.apps.completion --model facebook/opt-6.7b

# Complete with OPT-30B. You need about 90GB of CPU memory.
python3 -m flexllmgen.apps.completion --model facebook/opt-30b --percent 0 100 100 0 100 0

# Complete with instruction-tuned OPT-IML-MAX-30B. You need about 90GB of CPU memory.
python3 -m flexllmgen.apps.completion --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0

Frequently Asked Questions

How to set the offloading strategy and `--percent`?

We will release an automatic policy optimizer later, but now you have to manually try a few strategies. The idea of high-throughput generation is to offload parameters and attention cache as much as possible to the CPU and disk if necessary. You can see the reference strategies in our benchmark here. To avoid out-of-memory, you can tune the --percent to offload more tensors to the CPU and disk.

How to handle out-of-memory?

If you do not have enough GPU/CPU memory, here are a few things you can try. They save more memory but run slower.

Do not pin weights by adding --pin-weight 0. This can reduce the weight memory usage on CPU by around 20% or more.
Enable weight compression by adding --compress-weight. This can reduce the weight memory usage by around 70%.
Offload all weights to disk by using --percent 0 0 100 0 100 0. This requires very little CPU and GPU memory.

Performance Results

Generation Throughput (token/s)

The corresponding effective batch sizes and lowest offloading devices are in parentheses. Please see here for more details. | System | OPT-6.7B | OPT-30B | OPT-175B | | ------ | -------- | ------- | -------- | | Hugging Face Accelerate | 25.12 (2 on GPU) | 0.62 (8 on CPU) | 0.01 (2 on disk) | | DeepSpeed ZeRO-Inference | 9.28 (16 on CPU

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

399

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

10.3k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

FMInference

View profile

View on GitHub

GitHub Stars9.4k

CategoryEducation

Updated1d ago

Forks592

FMInference/FlexLLMGen

Languages

Python

Security Score

100/100

Audited on Mar 25, 2026

No findings