FastLoader
C++ Library to load data from disk in parallel.
Install / Use
/learn @usnistgov/FastLoaderREADME
Fast Loader (FL) : A High-Performance Accessor for Loading N-Dimensional (tiled) files
An application programming interface to access large files, such as images.
The API presents these images as views (sub-regions) that are streamed as soon as a region is loaded. Algorithms can then process these views in parallel to overlap loading from disk with CPU/GPU computation. FastLoader's approach attempts to maximize bandwidth through its parallelism and caching mechanisms.
Content
- [Installation Instructions]
- [Dependencies]
- [Building Fast Loader]
- [Motivation]
- [Approach]
- [Architecture]
- [Steps to Programming with Fast Loader]
- [Linking Fast Loader]
- [API overview]
- [How to create a Tile Loader ? How to access a specific file ?]
- [Getting started]
- [Credits]
- [Contact Us]
Installation Instructions
Dependencies
-
C++20 compiler (tested with gcc 11.1+, clang 10, and MSVC 14.33)
-
Hedgehog v.3 + (https://github.com/usnistgov/hedgehog)
-
LibTIFF (http://www.simplesystems.org/libtiff/) [optional / TIFF support]
-
doxygen (www.doxygen.org/) [optional / Documentation]
Building Fast Loader
CMake Options:
TEST_FAST_LOADER - Compiles and runs google unit tests for Fast Loader ('make run-test' to re-run)
:$ cd <FastLoader_Directory>
:<FastLoader_Directory>$ mkdir build && cd build
:<FastLoader_Directory>/build$ ccmake ../ (or cmake-gui)
'Configure' and setup cmake parameters
'Configure' and 'Build'
:<FastLoader_Directory>/build$ make
Motivation
The hardware landscape for high-performance computing currently features compute nodes with a high degree of parallelism within a node (e.g., 46 cores for the newly-announced Qualcomm Centriq CPU, 32 physical cores for the AMD Epyc CPU, and 24 logical cores for an Intel Xeon Skylake CPU), that increases with every new hardware generation. By contrast, the amount of memory available per core is not increasing in a commensurate manner and may even be decreasing especially when considered on a per-core basis. Furthermore, while the computational capacity of these systems keeps on improving, their programmability remains quite challenging. As such, designing image processing algorithms to minimize memory usage is a key strategy to taking advantage of this parallelism and support concurrent users or the multithreaded processing to large files.
Approach
Fast Loader improves programmer productivity by providing high-level abstractions, View and Tile, along with routines that build on these abstractions to operate across entire files without actually loading it in memory. The library operates on tiles with possibly a halo of pixels around an individual tile. Fast Loader only loads a small number of tiles to maintain a low-memory footprint and manages an in-memory cache. Furthermore, the library takes advantage of multicore computing by offloading tiles to compute threads as soon as they become available. This allows for multiple users to maintain high throughput, while processing several images or views concurrently.
Key concepts
Tile
The following picture represents an image divided into tiles. Each gray squares represent a pixel. A contiguous ND group of pixels is named a tile (surrounded by a black border in the image).

View
A View is a tile augmented by a halo of pixels around it in every dimension. These Views are then composed by a central tile (in red) and ghost values in green and brown. The green values need to be created, by default, FastLoader uses the values that already are in the buffer. Another Border Creator available fills these values with a constant. (An API exists to create user-defined Border Creator). The brown values are taken by the surrounding tiles.
View memory layout
Images and views have their dimensions ordered from the least to the most dense.
If the image is using multiple channels (RGB images for example), the channel number is considered a dimension.
In the following image the dimensions, in order, are depth (number of layers), height (number of rows), and width (number of columns).
By default, the radii (halo size) are applied to all dimensions (including the depth and the channels).
It is possible to define the radius per dimension.
This image shows memory layout for two views that have the same height and width but different depth size.

In memory, the data is stored contiguously from the least to the most dense, in this example, depth, height, and width.

In terms of performance, if the last dimensions has a size of 1 (if we consider the channel number in a grayscale image for example), it is better to not declare this dimension. A planar grayscale is better considered as a 2D image than a 3D image performance wise. The last dimension size is the unit-size of the copies between buffers in FastLoader to create the tiles and views.
Architecture
Fast Loader architecture shows how the system work and interact with the algorithm. First of all it works asynchronously from the algorithm. Secondly each part of Fast Loader will be on different threads. Finally, the memory manager guaranties that the amount of memory will be limited as asked.
When an algorithm will ask n views through View Request. Fast Loader will use this View Request to build a view and make it available as soon as possible. The algorithm will then be able to use it (and release it). In the meantime, if enough memory is available another view will be created.
The View creation go through 3 steps:
- View Loader: Request memory from the memory manager and split the View Request, to n Tile Loader and send them to the Tile Loader.
- Tile Loader: Specific to the file Fast Loader access to. Will ask the Tile to Tile Cache, if it not available the Tile will be loaded from the file, then cast and copy to the Tile Cache. From the cache only the interesting Tile's part will be copied to the View.
- View Counter: Wait for the view to be fully loaded from file's parts. Then build the ghost region if needed, and send the complete view to the algorithm.
The number of View created at a point in time by FastLoader is managed by the ViewWaiter. It is attached to a memory manager which throttles the number of views produced.
Following the Fast Loader Graph and its interaction with the algorithm:

The Adaptive FastLoader graph follows the same logic as the FastLoader graph while providing an added feature. The file as a physical structure. Its tiling is defined. The Adaptive FastLoader allows to request Views from the file but following another tiling size, called logical size. Another cache is used to store logical tiles temporarily. It avoids extra unnecessary access to the file.
Steps to Programming with Fast Loader
Linking Fast Loader
Fast Loader can be easily linked to any C++ 20 compliant code using cmake. Add the path to the folder FastImageDirectory/cmake-modules/ to the CMAKE_MODULE_PATH variable in your CMakeLists.txt. Then add the following lines in your CMakeLists.txt:
find_package(FastLoader REQUIRED)
target_link_libraries(TARGET ${FastLoader_LIBRARIES} ${Hedgehog_LIBRARIES})
target_include_directories(TARGET PUBLIC ${FastLoader_INCLUDE_DIR} ${Hedgehog_INCLUDE_DIR})
API overview
3 API exists in Fast Loader:
- The [Adaptive]FastLoaderGraph object to access views of an image
- The View object to access pixel/data in the View
- Tile Loader
How to create a Tile Loader ? How to access a specific file ?
To access to a new file format, a specific Tile Loader is needed. A specific Tile Loader class will inherit from the class AbstractTileLoader.
The following methods need to be implemented:
// Constructor
AbstractTileLoader(std::string const &name, std::filesystem::path filePath, size_t const nbThreads = 1)
// Copy function to duplicate the Tile Loader into n threads
virtual std::shared_ptr<AbstractTileLoader> copyTileLoader() = 0;
// Basic file information getter
[[nodiscard]] virtual size_t nbDims() const = 0;
[[nodiscard]] virtual size_t nbPyramidLevels() const = 0;
[[nodiscard]] virtual std::vector<std::string> const &dimNames() const = 0;
[[nodiscard]] virtual std::vector<size_t> const &fullDims([[maybe_unused]] std::size_t level) const = 0;
[[nodiscard]] virtual std::vector<size_t> const &tileDims([[maybe_unused]] std::size_t level) const = 0;
float downScaleFactor([[maybe_unused]] uint32_t level) [optional]
// Load a specific tile from the file, the tile has already allocated.
virtual void loadTileFromFile(std::shared_ptr<std::vector<DataType>> tile, std::vector<size_t> const &index, size_t level) = 0;
Here is an example Tile Loader for Grayscale Tiled 2D Tiff:
#include <tiffio.h>
#include "fast_loader/fast_loader.h"
template<class DataType>
class GrayscaleTiffTileLoader : public fl::AbstractTileLoader<fl::DefaultView<DataType>> {
TIFF *
tiff_ = nullptr; ///< Tiff file pointer
std::vector<size_t>
fullDims_{}, ///< File dimensions to the least to most dense
tileDims_{}; ///< Tile dimensions to the least to most dense
std::vector<std::string>
dimNames_{}; ///< Dimensions name to the least to most dense
short
sampleFormat_ = 0, ///< Sample format as defined by libtiff
bitsPerSample_ = 0; ///< Bit Per Sample as defined by libtiff
public:
/// @brief GrayscaleTiffTileLoader unique constructor
/// @param numbe
Related Skills
node-connect
341.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.5kCommit, push, and open a PR
