Comb

Comb is a communication performance benchmarking tool.

Generate Convert Improve

Install / Use

/learn @llnl/Comb

About this skill

Quality Score

0/100

README

Comb v0.3.1

Comb is a communication performance benchmarking tool. It is used to determine performance tradeoffs in implementing communication patterns on high performance computing (HPC) platforms. At its core comb runs combinations of communication patterns with execution patterns, and memory spaces in order to find efficient combinations. The current set of capabilities Comb provides includes:

Configurable structured mesh halo exchange communication.
A variety of communication patterns based on grouping messages.
A variety of execution patterns including serial, openmp threading, cuda, cuda fused kernels.
A variety of memory spaces including default system allocated memory, pinned host memory, cuda device memory, and cuda managed memory with different cuda memory advice.

It is important to note that Comb is very much a work-in-progress. Additional features will appear in future releases.

Quick Start

The Comb code lives in a GitHub repository. To clone the repo, use the command:

git clone --recursive https://github.com/llnl/comb.git

On an lc system you can build Comb using the provided cmake scripts and host-configs.

./scripts/lc-builds/blueos_nvcc_gcc.sh 10.1.243 sm_70 8.3.1
cd build_lc_blueos-nvcc10.1.243-sm_70-gcc8.3.1
make

You can also create your own script and host-config provided you have a C++ compiler that supports the C++11 standard, an MPI library with compiler wrapper, and optionally an install of cuda 9.0 or later.

./scripts/my-builds/compiler_version.sh
cd build_my_compiler_version
make

To run basic tests make a directory and make symlinks to the comb executable and scripts. The scripts expect a symlink to comb to exist in the run directory. The run_tests.bash script runs the basic_tests.bash script in 2^3 processes.

ln -s /path/to/comb/build_my_compiler_version/bin/comb .
ln -s /path/to/comb/scripts/* .
./run_tests.bash 2 basic_tests.bash

User Documentation

Minimal documentation is available.

Comb runs every combination of execution pattern, and memory space enabled. Each rank prints its results to stdout. The sep_out.bash script may be used to simplify data collection by piping the output of each rank into a different file. The combine_output.lua lua script may be used to simplify data aggregation from multiple files.

Comb uses a variety of manual packing/unpacking execution techniques such as sequential, openmp, and cuda. Comb also uses MPI_Pack/MPI_Unpack with MPI derived datatypes for packing/unpacking. (Note: tests using cuda managed memory and MPI datatypes are disabled as they sometimes produce incorrect results)

Comb creates a different MPI communicator for each test. This communicator is assigned a generic name unless MPI datatypes are used for packing and unpacking. When MPI datatypes are used the name of the memory allocator is appended to the communicator name.

Configure Options

The cmake configuration options change which execution patterns and memory spaces are enabled.

ENABLE_MPI Allow use of mpi and enable test combinations using mpi
ENABLE_OPENMP Allow use of openmp and enable test combinations using openmp
ENABLE_CUDA Allow use of cuda and enable test combinations using cuda
ENABLE_RAJA Allow use of RAJA performance portability library
ENABLE_CALIPER Allow use of the Caliper performance profiling library
ENABLE_ADIAK Allow use of the Adiak library for recording program metadata

Runtime Options

The runtime options change the properties of the grid and its decomposition, as well as the communication pattern used.

#_#_# Grid size in each dimension (Required)
-divide #_#_# Number of subgrids in each dimension (Required)
-periodic #_#_# Periodicity in each dimension
-ghost #_#_# The halo width or number of ghost zones in each dimension
-vars # The number of grid variables
-comm option Communication options
- cutoff # Number of elements cutoff between large and small message packing kernels
- enable|disable option Enable or disable specific message passing execution policies
  - all all message passing execution patterns
  - mock mock message passing execution pattern (do not communicate)
  - mpi mpi message passing execution pattern
  - gdsync libgdsync message passing execution pattern (experimental)
  - gpump libgpump message passing execution pattern
  - mp libmp message passing execution pattern (experimental)
  - umr umr message passing execution pattern (experimental)
- post_recv option Communication post receive (MPI_Irecv) options
  - wait_any Post recvs one-by-one
  - wait_some Post recvs in groups
  - wait_all Post all recvs
  - test_any Post recvs one-by-one
  - test_some Post recvs in groups
  - test_all Post all recvs
- post_send option Communication post send (MPI_Isend) options
  - wait_any pack and send messages one-by-one
  - wait_some pack messages then send them in groups
  - wait_all pack all messages then send them all
  - test_any pack messages asynchronously and send when ready
  - test_some pack multiple messages asynchronously and send when ready
  - test_all pack all messages asynchronously and send when ready
- wait_recv option Communication wait to recv and unpack (MPI_Wait, MPI_Test) options
  - wait_any recv and unpack messages one-by-one (MPI_Waitany)
  - wait_some recv messages then unpack them in groups (MPI_Waitsome)
  - wait_all recv all messages then unpack them all (MPI_Waitall)
  - test_any recv and unpack messages one-by-one (MPI_Testany)
  - test_some recv messages then unpack them in groups (MPI_Testsome)
  - test_all recv all messages then unpack them all (MPI_Testall)
- wait_send option Communication wait on sends (MPI_Wait, MPI_Test) options
  - wait_any Wait for each send to complete one-by-one (MPI_Waitany)
  - wait_some Wait for all sends to complete in groups (MPI_Waitsome)
  - wait_all Wait for all sends to complete (MPI_Waitall)
  - test_any Wait for each send to complete one-by-one by polling (MPI_Testany)
  - test_some Wait for all sends to complete in groups by polling (MPI_Testsome)
  - test_all Wait for all sends to complete by polling (MPI_Testall)
- allow|disallow option Allow or disallow specific communications options
  - per_message_pack_fusing Combine packing/unpacking kernels for boundaries communicated in the same message
  - message_group_pack_fusing Fuse packing/unpacking kernels across messages (and variables) in the same message group
-cycles # Number of times the communication pattern is tested
-omp_threads # Number of openmp threads requested
-exec option Execution options
- enable|disable option Enable or disable specific execution patterns
  - all all execution patterns
  - seq sequential CPU execution pattern
  - omp openmp threaded CPU execution pattern
  - cuda cuda GPU execution pattern
  - cuda_graph cuda GPU batched via cuda graph API execution pattern
  - hip hip GPU execution pattern
  - raja_seq RAJA sequential CPU execution pattern
  - raja_omp RAJA openmp threaded CPU execution pattern
  - raja_cuda RAJA cuda GPU execution pattern
  - raja_hip RAJA hip GPU execution pattern
  - mpi_type MPI datatypes MPI implementation execution pattern
-memory option Memory space options
- UseType enable|disable Optional UseType modifier for enable|disable, default is all. UseType specifies what uses to enable|disable, for example "-memory buffer disable cuda_pinned" disables cuda_pinned buffer allocations.
  - all all use types
  - mesh mesh use type
  - buffer buffer use type
- enable|disable option Enable or disable specific memory spaces for UseType allocations
  - all all memory spaces
  - host host CPU memory space
  - cuda_hostpinned cuda pinned memory space (pooled)
  - cuda_device cuda device memory space (pooled)
  - cuda_managed cuda managed memory space (pooled)
  - cuda_managed_host_preferred cuda managed with host preferred advice memory space (pooled)
  - cuda_managed_host_preferred_device_accessed cuda managed with host preferred and device accessed advice memory space (pooled)
  - cuda_managed_device_preferred cuda managed with device preferred advice memory space (pooled)
  - cuda_managed_device_preferred_host_accessed cuda managed with device preferred and host accessed advice memory space (pooled)
  - hip_hostpinned hip pinned memory space (pooled)
  - hip_hostpinned_coarse hip coarse grained (non-coherent) pinned memory space (pooled)
  - hip_device hip device memory space (pooled)
  - hip_device_fine hip fine grained device memory space (pooled)
  - hip_managed hip managed memory space (pooled)
-cuda_aware_mpi Assert that you are using a cuda aware mpi implementation and enable tests that pass cuda device or managed memory to M

Related Skills

node-connect

347.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。