FastChwHwcConverter

A high-performance, header-only C++ library for image tensor data format conversion, leveraging OpenMP parallel processing for unparalleled CPU performance and optimized CUDA/HIP implementations for AMD and NVIDIA GPUs, ensuring exceptional speed and scalability across diverse hardware platforms.

Generate Convert Improve

Install / Use

/learn @whyb/FastChwHwcConverter

About this skill

Quality Score

0/100

README

FastChwHwcConverter

Overview

Multi-Core CPU Implementation (C++Thread-OpenMP-oneTBB)

FastChwHwcConverter.hpp is a high-performance, multi-threaded, header-only C++ library for converting image data formats between HWC (Height, Width, Channels) and CHW (Channels, Height, Width). It leverages C++ STL Thread / OpenMP / Intel oneTBB for parallel processing, utilizing all CPU cores for maximum performance.

Note: If the compilation environment does not find OpenMP, or you set USE_OPENMP to OFF, it will be use C++ thread mode.

GPU Acceleration (NVIDIA CUDA)

FastChwHwcConverterCuda.hpp is a high-performance, GPU-accelerated library for converting image data formats between HWC and CHW, supporting CUDA versions 10.0+ and above. It requires no installation of the CUDA SDK, header files, or static linking. The library dynamically loads CUDA libraries from the system path. It will automatically search for CUDA's dynamic link library from the system path and dynamically load the functions inside and use them.

Note: If your operating environment does not support CUDA or does not meet the conditions for using CUDA acceleration, it will automatically fall back to the CPU (OpenMP/C++ Thread/Intel oneTBB) for processing. The functions support passing in cuda device memory and host memory parameters.

GPU Acceleration (AMD ROCm)

FastChwHwcConverterROCm.hpp is a high-performance, GPU-accelerated library for converting image data formats between HWC and CHW, supporting ROCm versions 5.0+ and above. Like the CUDA library, it does not require the ROCm (HIP) SDK, header files, or static linking, and dynamically loads ROCm libraries from the system path.

Note: If your operating environment does not support ROCm or does not meet the conditions for using ROCm acceleration, it will automatically fall back to the CPU (OpenMP/C++ Thread/Intel oneTBB) for processing. The functions support passing in ROCm device memory and host memory parameters.

Any similar type conversion code you find another project on GitHub will most likely only achieve performance close to the speed of single-thread execution.

Overview
The difference between CHW and HWC
- CHW Format
- HWC Format
Why Convert Between HWC and CHW Formats?
Features
Installation
Requirements
API Documents
Benchmark Performance Timing Results
Contact

The difference between CHW and HWC

Let's consider a 2x2 image with three channels (RGB).

Example Image Data:
```
Pixel 1 (R, G, B)    Pixel 2 (R, G, B)
Pixel 3 (R, G, B)    Pixel 4 (R, G, B)
```
We can store this image data in two different formats: CHW (Channel-Height-Width) and HWC (Height-Width-Channel).

CHW Format

CHW Format: In this format, the data is stored channel by channel. First, all the red channel data, then all the green channel data, and finally all the blue channel data.

For example (2x2 RGB Image):

RRRRGGGGBBBB

Mapping to the actual pixel positions:

R1, R2, R3, R4, G1, G2, G3, G4, B1, B2, B3, B4

HWC Format

HWC Format: In this format, the data is stored by each pixel's channels in sequence. So, the RGB data for each pixel is stored together.

For example (2x2 RGB Image):

RGBRGBRGBRGB

Mapping to the actual pixel positions:

(R1, G1, B1), (R2, G2, B2), (R3, G3, B3), (R4, G4, B4)

Why Convert Between HWC and CHW Formats?

The conversion between HWC (Height-Width-Channel) and CHW (Channel-Height-Width) formats is crucial for optimizing image processing tasks. Different machine learning frameworks and libraries have varying data format preferences. For instance, many deep learning frameworks, such as PyTorch, prefer the CHW format, while libraries like OpenCV often use the HWC format. By converting between these formats, we ensure compatibility and efficient data handling, enabling seamless transitions between different processing pipelines and maximizing performance for specific tasks. This flexibility enhances the overall efficiency and effectiveness of image processing and machine learning workflows.

Features

High-Performance: Utilizes C++ Thread / OpenMP / Intel oneTBB for parallel processing. Make full use of CPU multi-core features.
GPU Optimization: Fully leverages NVIDIA CUDA and AMD ROCm technologies to harness the computational power of GPUs, accelerating performance for intensive workloads.
Header-Only: Include ONLY a single header file. Easy to integrate into your C/C++ project. example.
Flexible: Supports scaling, clamping, and normalization of image data, any data type.
Lightweight & SDK-Free: No dependency on any external SDKs like CUDA SDK or HIP SDK. The project requires no additional header files or static library linkage, making it clean and easy to deploy.

Installation

for CPU (C++ Thread)

Simply include the header file FastChwHwcConverter.hpp in your project:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=OFF -DUSE_TBB=OFF -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=OFF -DBUILD_ROCM_BENCHMARK=OFF -DBUILD_EXAMPLE=OFF -DBUILD_EXAMPLE_OPENCV=OFF

cmake --build build --config Release

for CPU (OpenMP)

OpenMP is an API that supports multi-platform shared-memory multiprocessing programming. on many platforms, instruction-set architectures and operating systems. OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer. see more.

Option 1:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=ON -DUSE_TBB=OFF -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=OFF -DBUILD_ROCM_BENCHMARK=OFF -DBUILD_EXAMPLE=OFF -DBUILD_EXAMPLE_OPENCV=OFF

cmake --build build --config Release

Option 2:

Simply include the header file FastChwHwcConverter.hpp in your project. Before include, you need to add a macro #define USE_OPENMP 1.

for CPU (oneTBB)

Intel oneTBB (Intel® oneAPI Threading Building Blocks) is a simplify parallelism with this advanced threading and memory-management template library. This component is part of the Intel® oneAPI Base Toolkit. see more.

Option 1:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=OFF -DUSE_TBB=ON -DTBB_DIR=D:/extlibs/oneAPI/tbb/2021.13/lib/cmake/tbb -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=OFF -DBUILD_ROCM_BENCHMARK=OFF -DBUILD_EXAMPLE=OFF -DBUILD_EXAMPLE_OPENCV=OFF

cmake --build build --config Release

Option 2:

Simply include the header file FastChwHwcConverter.hpp in your project. Before include, you need to add a macro #define USE_TBB 1.

for GPU (CUDA or ROCm)

NVIDIA CUDA Official Website

AMD ROCm Official Website

Option 1:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=OFF -DUSE_TBB=OFF -DBUILD_BENCHMARK=ON -DBUILD_CUDA_BENCHMARK=ON -DBUILD_ROCM_BENCHMARK=ON -DBUILD_EXAMPLE=ON -DBUILD_EXAMPLE_OPENCV=ON

cmake --build build --config Release

Option 2:

Simply include the header file FastChwHwcConverterCuda.hpp or FastChwHwcConverterRocm.hpp in your project:
```
#include "FastChwHwcConverterCuda.hpp"
```
```
#include "FastChwHwcConverterROCm.hpp"
```

Usually you also need to copy the nvrtc64_***_0.dll nvrtc-builtins64_*** (for Windows CUDA) or hiprtc****.dll hiprtc-builtins****.dll amd_comgr_*.dll amd_comgr****.dll (for Windows ROCm) or libnvrtc.so (for Linux CUDA) or libhiprtc.so (for Linux ROCm) file in the CUDA/ROCm Runtime SDK to the executable program directory, or set CUDA/ROCm SDK HOME as a system environment variable.

In addition, you need to download and install the latest version of the driver from the NVIDIA drivers website or AMD drivers website. Because this project will dynamically load driver file: nvcuda.dll (for Windows CUDA) or amdhip64_6.dll (for Windows ROCm) or libcuda.so (for Linux CUDA) or libamdhip64.so (for Linux ROCm).

Requirements

C++17 or later
OpenMP support (optional, set USE_OPENMP to ON for high performance)
oneTBB support (optional, set USE_TBB to ON and set valid TBB_LIBS for Intel oneTBB's high performance)
CMake v3.10 or later (optional)
OpenCV v4.0 or later (optional, if BUILD_EXAMPLE_OPENCV is ON)
CUDA 11.2+ driver (optional, if you want to use CUDA acceleration, And NVIDIA GPU's compute capability >

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

whyb

View profile

View on GitHub

GitHub Stars10

CategoryDevelopment

Updated9d ago

Forks0

whyb/FastChwHwcConverter

Languages

C++

Security Score

90/100

Audited on Mar 23, 2026

No findings

FastChwHwcConverter

Install / Use

README

FastChwHwcConverter

Overview

Multi-Core CPU Implementation (C++Thread-OpenMP-oneTBB)

GPU Acceleration (NVIDIA CUDA)

GPU Acceleration (AMD ROCm)

Table of Contents

The difference between CHW and HWC

CHW Format

HWC Format

Why Convert Between HWC and CHW Formats?

Features

Installation

for CPU (C++ Thread)

for CPU (OpenMP)

for CPU (oneTBB)

for GPU (CUDA or ROCm)

Requirements

Related Skills