Fleetbench
Benchmarking suite for Google workloads
Install / Use
/learn @google/FleetbenchREADME
Fleetbench
Fleetbench is a benchmarking suite for Google workloads. It's a portmanteau of "fleet" and "benchmark". It is meant for use by chip vendors, compiler researchers, and others interested in making performance optimizations beneficial to workloads similar to Google's. This repository contains the Fleetbench C++ code.
Details on Fleetbench can be found in our paper A Profiling-Based Benchmark Suite for Warehouse-Scale Computers.
NOTE: As this project is evolving, we recommend including the tag/release number when citing it to avoid any confusion.
Overview
Fleetbench is a benchmarking suite that consists of a curated set of microbenchmarks for hot functions across Google's fleet. The data set distributions it uses for executing the benchmarks are derived from data collected in production.
IMPORTANT: We released our latest v2.0.0, a major milestone that significantly enhances our benchmarking suite's capability to accurately characterize system performance under realistic, concurrent workloads. This release ensures the benchmark suite is runnable on emulation and real hardware, and introduces a powerful multiprocessing framework, alongside critical new benchmarks (gRPC and SIMD), and substantial improvements and bug fixes across the suite. Please check out the release note for more information!
This new version represents a substantial step forward in capturing system performance from diverse angles, enabling developers and performance engineers to gain granular insights into how important libraries behave in complex, multi-core environments.
For more information, see:
- Workloads coverage for latest suite coverage.
- Versioning on the details of Fleetbench's releases.
- Running Benchmarks for how to run the benchmark.
- Future Work section on all that we plan to add.
Benchmark fidelity
Benchmark fidelity is an important consideration in building this suite. There are 3 levels of fidelity that we consider:
- The suite exercises the same functionality as production.
- The suite's performance counters match production.
- An optimization impact on the suite matches the impact on production.
Versioning
Fleetbench uses semantic versioning for its releases, where
PATCH versions will be used for bug fixes, MINOR for updates to
distributions and category coverage, and MAJOR for substantial changes to the
benchmarking suite. All releases will be tagged, and the suite can be built and
run at any version of the tag.
If you're starting out, authors recommend you always use the latest version at HEAD only.
Workloads coverage
As of Q2'25, Fleetbench provides coverage for several major hot libraries.
Benchmark | Description
----------- | -----------
Proto | Instruction-focused.
Swissmap | Data-focused.
Libc | Data-focused. Benchmarking memcpy, memcmp/bcmp, memset, and memmove.
TCMalloc | Data-focused.
Compression | Data-focused. Covers Snappy, ZSTD, Brotli, and Zlib.
Hashing | Data-focused. Supports algorithms CRC32 and absl::Hash.
STL-Cord | Instruction-focused.
RPC | Instruction-focused with a strong data-drive aspect and buit on gRPC framework
SIMD | ScaNN LUT16 based and measures performance of lookup-and-accumulate.
Benchmarks are classified by their core characteristics, such as being compute-bound, memory-bound, or sensitive to memory bandwidth vs. latency. For a detailed breakdown, see the benchmark characteristics documentation.
Running Benchmarks
Fleetbench supports running benchmarks in two modes: single-threaded and multi-cores. The following command is for a single-threaded run. For multi-core execution, please refer to the parallel run instructions.
Setup
Bazel is the official build system for Fleetbench.
We currently require Bazel version 8.0.0. Consider Bazelisk to automatically manage your Bazel version.
NOTE: Our setup uses LLVM 17.0.1.
As an example, to run the Swissmap benchmarks:
bazel run --config=opt fleetbench/swissmap:swissmap_benchmark
Important: Always run benchmarks with --config=opt to apply essential compiler
optimizations.
Run commands
Replacing the $WORK_LOAD and $BUILD_TARGET with one of the entry in the
table to build and run the benchmark. The reasons why we add each build flag are
explained in the next few sections.
X86 {.new-tab}
bazel build --config=clang --config=opt \
--config=haswell fleetbench/WORK_LOAD:BUILD_TARGET
GLIBC_TUNABLES=glibc.pthread.rseq=0 bazel-bin/fleetbench/WORK_LOAD/BUILD_TARGET
Or combining build and run together:
GLIBC_TUNABLES=glibc.pthread.rseq=0 bazel run --config=clang --config=opt \
--config=haswell fleetbench/WORK_LOAD:BUILD_TARGET
Arm {.new-tab}
bazel build --config=clang --config=opt \
--config=arm fleetbench/WORK_LOAD:BUILD_TARGET
GLIBC_TUNABLES=glibc.pthread.rseq=0 bazel-bin/fleetbench/WORK_LOAD/BUILD_TARGET
Or combining build and run together:
GLIBC_TUNABLES=glibc.pthread.rseq=0 bazel run --config=clang --config=opt \
--config=arm fleetbench/WORK_LOAD:BUILD_TARGET
</section>
Benchmark | WORKLOAD | BUILD_TARGET | Binary run flags
----------- | ----------- | --------------------- | ----------------
Proto | proto | proto_benchmark |
Swissmap | swissmap | swissmap_benchmark |
Libc memory | libc | mem_benchmark | --benchmark_counters_tabular=true
TCMalloc | tcmalloc | empirical_driver | Check --benchmark_filter below.
Compression | compression | compression_benchmark | --benchmark_counters_tabular=true
Hashing | hashing | hashing_benchmark | --benchmark_counters_tabular=true
STL-Cord | stl | cord_benchmark |
RPC | rpc | rpc_benchmark |
SIMD | simd | simd_benchmark |
NOTE: By default, each benchmark only runs a minimal set of tests that we have
selected as the most representative. To see the default lists, you can use the
--benchmark_list_tests flag when running the target. You can add
--benchmark_filter=all to see the exhaustive list.
You can also specify a regex in --benchmark_filter flag to specify a subset of
benchmarks to run
(more info).
The TCMalloc Empirical Driver benchmark can take ~1hr to run all benchmarks, so
running a subset may be advised.
Example to run for only sets of 16 and 64 elements of swissmap:
bazel run --config=opt fleetbench/swissmap:swissmap_benchmark -- \
--benchmark_filter=".*set_size:(16|64).*"
To extend the runtime of a benchmark, e.g. to collect more profile samples, use
--benchmark_min_time.
bazel run --config=opt fleetbench/proto:proto_benchmark -- --benchmark_min_time=30s
Some benchmarks also provide counter reports after completion. Adding
--benchmark_counters_tabular=true
(doc)
can help print counters as table columns for improved layout.
Ensuring TCMalloc per-CPU Mode
TCMalloc is the underlying memory allocator in this benchmark suite. By default it operates in per-CPU mode.
Note: the Restartable Sequences (RSEQ)
kernel feature is required for per-CPU mode. RSEQ has the limitation that a
given thread can only register a single rseq structure with the kernel. Recent
versions of glibc do this on initialization,
preventing TCMalloc from using
it.
Set the environment variable: GLIBC_TUNABLES=glibc.pthread.rseq=0 to prevent
glibc from doing this registration. This will allow TCMalloc to operate in
per-CPU mode.
Clang Toolchain
For more consistency with Google's build configuration, we suggest using the Clang / LLVM tools. These instructions have been tested with LLVM 14.
These can be installed with the system's package manager, e.g. on Debian:
sudo apt-get install clang llvm lld
Otherwise, see https://releases.llvm.org to obtain these if not present on your system or to find the newest version.
Once installed, specify --config=clang on the bazel command line to use the
clang compiler. We assume clang and lld are in the PATH.
Note: to make this setting the default, add build --config=clang to your
.bazelrc.
Architecture-Specific Flags
If running on an x86 Haswell or above machine, we suggest adding
--config=haswell for consistency with our compiler flags.
Use --config=westmere for Westmere-era processors, and --config=arm for ARM
ones.
Reducing run-to-run variance
It is expected that there will be some variance in the reported CPU times across benchmark executions. The benchmark itself runs the same code, so the causes of the variance are mainly in the environment. The following is a non-exhaustive list of techniques that he
Related Skills
node-connect
342.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.7kCommit, push, and open a PR
