SkillAgentSearch skills...

DagFlow

A self-contained, minimal runtime for parallel task execution in C++20.

Install / Use

/learn @cpp20120/DagFlow
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

DagFlow

CMake License .

DAG-flow runtime.

A self-contained, minimal runtime for parallel task execution in C++20.
Designed for workloads where you know the dependency graph in advance, need predictable scheduling, and want full control over affinity, priorities, and back-pressure — without the complexity of a full TBB.

Mini-runtime for parallel tasks in C++20:

  • Work-stealing pool (Chase–Lev decks, central MPMC queues).

  • Lockfree MPMC (Michael–Scott) with QSBR path.

  • A local small_function for cheap closures.

  • A lightweight DAG graph with cancellation, competition limits, and back-pressure.

  • High-level API in the spirit of TBB: submit/then/when_all/parallel_for via TaskScope.

Design

  • Scheduler: local deques (Chase–Lev) + "central" MPMC queues for external submissions; worker-first, then an attempt to steal from neighbors.

  • MPMC: Michael–Scott with a QSBR path (quiet state based reclamation) for queue tails in the pool.

  • Synchronization: memory_order (acquire/release/seq_cst where reconciliation is needed).

  • Graph Security: Workers keep Core (shared_ptr) → no UAF, even if the TaskGraph was destroyed before wait().

By default, small_function<void(), 128 > in nodes. If you catch a Callable too large, increase the SSO (256) or pack the captures into the std::shared_ptr block.

For "heavy" stages, set concurrency > 1 in ScheduleOptions/NodeOptions.

Configure Config for CPU/NUMA; pin_threads=true is useful for cache stability.

In large pipelines, turn on back-pressure (capacity, Overflow) at the bottleneck stages.

Benchmarks

CPU: Ryzen 7 6800H OS: Windows 11 Build Mode: Release Compiler: MSVCx64 19.44.35214 Config: pin_threads = false

| Benchmark | Runs | Mean Time | Min Time | Max Time | Cost of 1 task +- | |----------------------------------------|------|------------|-----------|-----------|------------------------------------| | Dependent chain (1000 tasks) | 5 | 0.00579 s | 0.00491 s | 0.00765 s | ~3–4 μs per node | | Independent tasks (1000) | 5 | 1.5283 s | 1.41794 s | 1.64484 s | — | | Independent batched (1000, batch=10)| 5 | 1.41718 s | 1.39467 s | 1.44458 s | ~7–20% faster, than without batch | | Parallel_for (1e6 elements) | 5 | 0.54729 s | 0.53645 s | 0.57182 s | — | | Workflow (width=10, depth=5) | 5 | 0.00161 s | 0.00147 s | 0.00179 s | ~30–35 μs for all DAG | | Noop tasks (1 000 000) | 5 | 4.411 s* | 4.37711 s | 4.48551 s | ~4.7–4.8 μs per task |

  • Numbers may vary with CPU governor, background load, and configuration.

Compare with TBB

| Benchmark | Problem size | My TP mean (s) | TBB mean (s) | Speedup (TP / TBB) | TP throughput | TBB throughput | | ------------------- | --------------: | ---------------: | -----------: | -----------------: | ----------------: | ------------------: | | Dependent chain | 1000 tasks | 0.00579 | 0.00288156 | 2.01× | ~172,712 tasks/s | ~347,034 tasks/s | | Independent tasks | 1000 tasks | 1.52830 | 1.25051 | 1.22× | ~654.3 tasks/s | ~799.7 tasks/s | | Independent batched | 1000 (batch=10) | 1.41718 | 1.30121 | 1.09× | ~705.6 tasks/s | ~768.5 tasks/s | | parallel_for(old) | 1,000,000 elems | 0.54729 | 0.0268362 | 20.39× | ~1.83M elems/s | ~37.26M elems/s | |for_each_ws (new) |1,000,000 elems | 0.00957 | 0.0268362 | 0.36×* | ~104.5M elems/s| ~37.26M elems/s | | Workflow (w=10,d=5) | ~50 stage ops | 0.00161 | 0.00029248 | 5.50× | — | — | | Noop tasks | 1,000,000 tasks | 4.411 | 0.225568 | 20.75× | ~213,630 tasks/s | ~4,433,253 tasks/s |

* - (2.8× faster than TBB)

For each versions

| For each version | Runs | Mean Time | Min Time | Max Time | Notes / Speedup | | ---------------------------------- | ---- | --------- | --------- | --------- | ---------------------------- | | for_each (static chunking) | 5 | 0.01761 s | 0.01456 s | 0.02074 s | baseline | | for_each_ws (range stealing) | 5 | 0.00957 s | 0.00855 s | 0.01058 s | ~1.84× faster vs static |

for_each_ws (range stealing) — Lazy binary range partitioning with stealing "upper" halves (help-first). Uniform loading of threads on uneven load and less "tails". Compatible only with RA iterators.

Visualization

First Second Third

Build and usage

git clone https://github.com/cpp20120/DagFlow.git
cd DagFlow
cmake -B build -DTp_BUILD_EXAMPLES=ON
cmake --build build --config Release

How to use at your cmake project

  1. Via add_subdirectory:
add_subdirectory(external/DagFlow)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE DagFlow)
  1. Via find_package:
find_package(DagFlow REQUIRED)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE DagFlow::DagFlow)

find_package works after cmake --install.

2.5 On windows add to your target:

if (WIN32 AND TP_BUILD_SHARED)
  add_custom_command(TARGET ${CMAKE_PROJECT_NAME} POST_BUILD
    COMMAND ${CMAKE_COMMAND} -E copy_if_different
      $<TARGET_FILE:DagFlow>
      $<TARGET_FILE_DIR:my_app>
  )
endif()

Exampe of usage: there

Minimal example

#include <dagflow.hpp>

dagflow::Pool pool;
dagflow::TaskScope scope(pool);

auto a = scope.submit([] { /* … */ });
auto b = scope.then(a, [] { /* … */ });
auto c = scope.when_all({a, b}, [] { /* … */ });

scope.run_and_wait();

Note: This pool is not yet optimized noop and non ws version of for_each microbenchmarks — missing chunk/range stealing and has a non-ideal queue implementation for such patterns. The focus is DAG execution, affinity control, and back-pressure.

View on GitHub
GitHub Stars15
CategoryDevelopment
Updated3mo ago
Forks0

Languages

C++

Security Score

92/100

Audited on Dec 21, 2025

No findings