SkillAgentSearch skills...

FastChwHwcConverter

A high-performance, header-only C++ library for image tensor data format conversion, leveraging OpenMP parallel processing for unparalleled CPU performance and optimized CUDA/HIP implementations for AMD and NVIDIA GPUs, ensuring exceptional speed and scalability across diverse hardware platforms.

Install / Use

/learn @whyb/FastChwHwcConverter
About this skill

Quality Score

0/100

Supported Platforms

Zed

README

FastChwHwcConverter

CI

Overview

Multi-Core CPU Implementation (C++Thread-OpenMP-oneTBB)

FastChwHwcConverter.hpp is a high-performance, multi-threaded, header-only C++ library for converting image data formats between HWC (Height, Width, Channels) and CHW (Channels, Height, Width). It leverages C++ STL Thread / OpenMP / Intel oneTBB for parallel processing, utilizing all CPU cores for maximum performance.

Note: If the compilation environment does not find OpenMP, or you set USE_OPENMP to OFF, it will be use C++ thread mode.

GPU Acceleration (NVIDIA CUDA)

FastChwHwcConverterCuda.hpp is a high-performance, GPU-accelerated library for converting image data formats between HWC and CHW, supporting CUDA versions 10.0+ and above. It requires no installation of the CUDA SDK, header files, or static linking. The library dynamically loads CUDA libraries from the system path. It will automatically search for CUDA's dynamic link library from the system path and dynamically load the functions inside and use them.

Note: If your operating environment does not support CUDA or does not meet the conditions for using CUDA acceleration, it will automatically fall back to the CPU (OpenMP/C++ Thread/Intel oneTBB) for processing. The functions support passing in cuda device memory and host memory parameters.

GPU Acceleration (AMD ROCm)

FastChwHwcConverterROCm.hpp is a high-performance, GPU-accelerated library for converting image data formats between HWC and CHW, supporting ROCm versions 5.0+ and above. Like the CUDA library, it does not require the ROCm (HIP) SDK, header files, or static linking, and dynamically loads ROCm libraries from the system path.

Note: If your operating environment does not support ROCm or does not meet the conditions for using ROCm acceleration, it will automatically fall back to the CPU (OpenMP/C++ Thread/Intel oneTBB) for processing. The functions support passing in ROCm device memory and host memory parameters.

Any similar type conversion code you find another project on GitHub will most likely only achieve performance close to the speed of single-thread execution.

Table of Contents

The difference between CHW and HWC

Let's consider a 2x2 image with three channels (RGB).

  • Example Image Data:
    Pixel 1 (R, G, B)    Pixel 2 (R, G, B)
    Pixel 3 (R, G, B)    Pixel 4 (R, G, B)
    
    We can store this image data in two different formats: CHW (Channel-Height-Width) and HWC (Height-Width-Channel).

CHW Format

CHW Format: In this format, the data is stored channel by channel. First, all the red channel data, then all the green channel data, and finally all the blue channel data.

For example (2x2 RGB Image):

RRRRGGGGBBBB

Mapping to the actual pixel positions:

R1, R2, R3, R4, G1, G2, G3, G4, B1, B2, B3, B4

HWC Format

HWC Format: In this format, the data is stored by each pixel's channels in sequence. So, the RGB data for each pixel is stored together.

For example (2x2 RGB Image):

RGBRGBRGBRGB

Mapping to the actual pixel positions:

(R1, G1, B1), (R2, G2, B2), (R3, G3, B3), (R4, G4, B4)

Why Convert Between HWC and CHW Formats?

The conversion between HWC (Height-Width-Channel) and CHW (Channel-Height-Width) formats is crucial for optimizing image processing tasks. Different machine learning frameworks and libraries have varying data format preferences. For instance, many deep learning frameworks, such as PyTorch, prefer the CHW format, while libraries like OpenCV often use the HWC format. By converting between these formats, we ensure compatibility and efficient data handling, enabling seamless transitions between different processing pipelines and maximizing performance for specific tasks. This flexibility enhances the overall efficiency and effectiveness of image processing and machine learning workflows.

Features

  • High-Performance: Utilizes C++ Thread / OpenMP / Intel oneTBB for parallel processing. Make full use of CPU multi-core features.
  • GPU Optimization: Fully leverages NVIDIA CUDA and AMD ROCm technologies to harness the computational power of GPUs, accelerating performance for intensive workloads.
  • Header-Only: Include ONLY a single header file. Easy to integrate into your C/C++ project. example.
  • Flexible: Supports scaling, clamping, and normalization of image data, any data type.
  • Lightweight & SDK-Free: No dependency on any external SDKs like CUDA SDK or HIP SDK. The project requires no additional header files or static library linkage, making it clean and easy to deploy.

Installation

for CPU (C++ Thread)

Simply include the header file FastChwHwcConverter.hpp in your project:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=OFF -DUSE_TBB=OFF -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=OFF -DBUILD_ROCM_BENCHMARK=OFF -DBUILD_EXAMPLE=OFF -DBUILD_EXAMPLE_OPENCV=OFF

cmake --build build --config Release

for CPU (OpenMP)

OpenMP is an API that supports multi-platform shared-memory multiprocessing programming. on many platforms, instruction-set architectures and operating systems. OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer. see more.

  • Option 1:

    cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=ON -DUSE_TBB=OFF -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=OFF -DBUILD_ROCM_BENCHMARK=OFF -DBUILD_EXAMPLE=OFF -DBUILD_EXAMPLE_OPENCV=OFF
    
    cmake --build build --config Release
    
  • Option 2:

    Simply include the header file FastChwHwcConverter.hpp in your project. Before include, you need to add a macro #define USE_OPENMP 1.

for CPU (oneTBB)

Intel oneTBB (Intel® oneAPI Threading Building Blocks) is a simplify parallelism with this advanced threading and memory-management template library. This component is part of the Intel® oneAPI Base Toolkit. see more.

  • Option 1:

    cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=OFF -DUSE_TBB=ON -DTBB_DIR=D:/extlibs/oneAPI/tbb/2021.13/lib/cmake/tbb -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=OFF -DBUILD_ROCM_BENCHMARK=OFF -DBUILD_EXAMPLE=OFF -DBUILD_EXAMPLE_OPENCV=OFF
    
    cmake --build build --config Release
    
  • Option 2:

    Simply include the header file FastChwHwcConverter.hpp in your project. Before include, you need to add a macro #define USE_TBB 1.

for GPU (CUDA or ROCm)

NVIDIA CUDA Official Website

AMD ROCm Official Website

  • Option 1:

    cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=OFF -DUSE_TBB=OFF -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=ON -DBUILD_ROCM_BENCHMARK=ON -DBUILD_EXAMPLE=ON -DBUILD_EXAMPLE_OPENCV=ON
    
    cmake --build build --config Release
    
  • Option 2:

    Simply include the header file FastChwHwcConverterCuda.hpp or FastChwHwcConverterRocm.hpp in your project:

    #include "FastChwHwcConverterCuda.hpp"
    
    #include "FastChwHwcConverterROCm.hpp"
    

Usually you also need to copy the nvrtc64_***_0.dll nvrtc-builtins64_*** (for Windows CUDA) or hiprtc****.dll hiprtc-builtins****.dll amd_comgr_*.dll amd_comgr****.dll (for Windows ROCm) or libnvrtc.so (for Linux CUDA) or libhiprtc.so (for Linux ROCm) file in the CUDA/ROCm Runtime SDK to the executable program directory, or set CUDA/ROCm SDK HOME as a system environment variable.

In addition, you need to download and install the latest version of the driver from the NVIDIA drivers website or AMD drivers website. Because this project will dynamically load driver file: nvcuda.dll (for Windows CUDA) or amdhip64_6.dll (for Windows ROCm) or libcuda.so (for Linux CUDA) or libamdhip64.so (for Linux ROCm).

Requirements

  • C++17 or later
  • OpenMP support (optional, set USE_OPENMP to ON for high performance)
  • oneTBB support (optional, set USE_TBB to ON and set valid TBB_LIBS for Intel oneTBB's high performance)
  • CMake v3.10 or later (optional)
  • OpenCV v4.0 or later (optional, if BUILD_EXAMPLE_OPENCV is ON)
  • CUDA 11.2+ driver (optional, if you want to use CUDA acceleration, And NVIDIA GPU's compute capability >

Related Skills

View on GitHub
GitHub Stars10
CategoryDevelopment
Updated9d ago
Forks0

Languages

C++

Security Score

90/100

Audited on Mar 23, 2026

No findings