LibAtsSim

Code for SIGGRAPH 2025 conference paper "Automated Task Scheduling for Cloth and Deformable Body Simulations in Heterogeneous Computing Environments".

Generate Convert Improve

Install / Use

/learn @ChengzhuUwU/LibAtsSim

About this skill

Quality Score

0/100

README

libAtsSim

Code for SIGGRAPH 2025 conference paper "Auto Task Scheduling for Cloth and Deformable Simulation on Heterogeneous Environments".

Authors: Chengzhu He, Zhendong Wang, Zhaorui Meng, Junfeng Yao, Shihui Guo, Huamin Wang.

We provided several examples to show our scheduler and asynchronous iteration progress.

Example 1: Simplist HEFT case

This example shows how HEFT algorithm makes scheduling, we use the original case given by the paper "Performance-effective and low-complexity task scheduling for heterogeneous computing" (IEEE TPDS 2002, H. Topcuoglu et. al.).

Given the DAG (represent the relationship between tasks) and the computation matrix of each task on each devices, how do we make schedling that allocated each tasks into devices?

HEFT 2002 DAG and Matrix

HEFT algorithm will quantify the dependency of each task ($rank_u$) and sorted them. Based on the sorted task list, the scheduler then allocated each tasks into the devices with the ealiest-finish-time (EFT). The scheduling result should be (a) in the following figure (figure (b) is scheduled based on another scheduling algorithm "CPOP"):

HEFT 2002 scheduling result

The final time (represented the finished time of the latest task across all devices, $n_{10}$) should be 91.

Speedup to proc 0 = 39.56% (From 127.00 to 91.00 )

Speedup to proc 1 = 42.86% (From 130.00 to 91.00 )

Speedup to proc 2 = 57.14% (From 143.00 to 91.00 )

Our HEFT implementation is based on a python-version heft, which is the source code of paper "Performant, Multi-Objective Scheduling of Highly Interleaved Task Graphs on Heterogeneous System on Chip Devices" (IEEE TPDS 2022, Joshua Mack et. al.).

HEFT scheduling algorithm

Rrepresent the relationship between tasks as DAG(Directed Acyclic Graph): In DAG, each node represent for each tasks, each edge represent for the relying dependencies (e.g., broadphase collision detection should complete before narrowphase collision detection. If two tasks perform on different devices, we may need to copy the relative data to target device).

In practical, we can save the information of edges in node's data structrue as two lists: predecessors, and successors (of each node).

Topology sorting: By perfoming ropology soring, we can get a ordered list, which generates an ordered table that ensures the dependencies between tasks (Which means by execute tasks in the order of this list, we can ensure that the dependencies of each task can be satisfied).

We can use DFS-based or BFS-based tojpology sorting. Both methods can satisfy the dependencies, however, we recommend using the DFS-based (Depth First Traversal) topology sorting, which tends to group a set of tasks with tighter relationships together, making it easier for us to perform tasks merging and other operations in the future. In addition, the topology sorting method based on DFS can also facilitate us to identify the loops in DAG

(Optional) We cab execute the sorted list in sequence several times, to obtain the computation matrix for each task. and communication matrix between devices. For UMA (Unified Memory Architecture), the communication matrix is a constant matrix (About 0.2ms accross decices).
$rank_u$ calculation: We traverse the reverse order of list by topology sorting, and for each node $i$, we calculte its $rank_u$ value as (Condiering we have $m$ processors):

$$ rank_u [i] = \frac{1}{m} \sum_{p \in \text{processors}} \text{comp}[i][p] + \max_{j \in \text{successors}} (rank_u[j] + \frac{1}{m} \sum_{p \in \text{processors}} \text{comm}[i][j][p] ) $$

Which means we calculate $rank_u$ from the last task to the first task. $rank_u$ represents the estimated time span from the current task to the last task. It quantifies the dependency of each task. If a task has more subsequent tasks, it represents a higher degree of dependency, that is, its $rank_u$ value will also be higher.

EFT calculation and task assignment: We traverse the task list sorted in reverse order by $rank_u$, and calculate the EFT (Eerlest Finish Time) of each task on different devices. We use the device with the smallest EFT value as the inserted device.
- Since We need to satisfy the dependency relationship between tasks, so the earliest execution time of the task must not be earlier than the finish time of its predecessors (if its predecessors are executed on other devices, the communication cost between devices also needs to be considered)
- How do to calculate EFT: Researchers will divide the time average into discrete timestamps (e.g., 100 or 1000 timestamps), traverse these timestamps, and calculate the earlist available time. This method is inefficient and relies on timestamp partitioning. We refer to the implentation from python-version heft, which only need to consider the following situations (Given the ealist start time ready_time):
1. Empty schedule → Task starts at ready_time.
2. Before the first task → If there’s enough gap, start at ready_time.
3. Between two tasks → If a gap is large enough, start after the finish time of the first task.
4. After the last task → Start after the finish time of the last task.
5. Fallback → Default to starting at ready_time.

You can also refer Our Pre-Presentation in SIGGRAPH for more detail.

Example 2: Asychronous iteration on VBD

This example shows the difference between the original iteration pipeline and our asynchronous iteration pileline. Considering we have allocated iteration tasks (different clusters in Vertex Block Descent, SIGGRAPH 2024, Anka He Chen et. al.) into 2 devices, then how do we make data transfering?

This example only considering the simplist case: costs of tasks is a constant value $t_c$ and the comminication delay is exactly the same value $t_c$, then we make make the data transfering as follows:

Example 2 iter 1

Different lines means different devices. Here the upper line refers to tasks on GPU and the lower line refers to CPU.

If we considering the overlapping across iterations (such as 10 iterations):

Example 2 iter 10

Figures that visualizing the scheduling reult is generated by "documents/example2_scheduling_result.py".

We use mass-spring streching and quadratic bending model, this can also extended to other linear or non-linear energy model.

The convergent rate between sync-based simulation and async-based simulation is nearly the same (in some cases, async-based will convergent faster, especially considering collision energy). However, we will get about 90% speedup compared to single device implementation (From 10.30ms to 5.40ms):

Example 2 iter 100

Figures that visualizing the convergence rate of async-based and sync-based simulation (the 30 th frame convergence result with 100 iterations each frame, timestep h=1/60s) is generated by "documents/example2_async_convergence.py".

The simulation result in the 30th frame:

| Async_Result | Sync_Result | |--------------|-------------| | | |

Note that:

In this paper, we summarize the potential communication scenarios between devices. In practice, we specify the communication rules through the scheduling result, this is actually a forward process, not a reverse one:


# Step 0: Make scheduling
task_schedules = {}
proc_schedules = {}
task_schedules, proc_schedules = make_schediling()
size(task_schedules) == num_tasks # The scheduling result for each tasks
size(proc_schedules) == num_procs # num_procs == 2 for CPU/GPU
size(proc_schedules[0]) + size(proc_schedules[1]) == num_tasks

# Step 1: Find the shortest connection of each task between deivices
list_in = {{} for each tasks}
list_out = {{} for each tasks}
for proc in procs:
    for schedule in proc_schedules[proc]:
        tid = schedule.task_id
        if not is_constraint_task[tid]: 
            continue

        for another_schedule in proc_schedules[another_proc, another_proc != proc]:
            another_tid = another_schedule.task_id
            comm_time = communication_cost_matrix[proc][another_proc];
            if is_constraint_task[another_tid] and schedule.end + comm_time <= another_schedule.start:
                list_in[tid] = {another_tid}
                list_out[another_tid] = {list_out[another_tid], tid}

Then we will merge redundant waiting between tasks:

# Step 2: Merge redundant connections
# Step 2.1: For non-main devices (e.g., CPUs)
for proc in procs where not is_main_device[proc]:
    for task in proc_schedules[proc]:
        inputs = list_in[task]

        if |inputs| > 1:
            keep = inputs.last
            for input in inputs[0 : end-1]:
                remove_edge(input → task)   # drop redundant connections
            list_in[task] = {keep}

        if |inputs| > 0:
            connect(inputs.last, task)

# Step 2.2: For main device (e.g., GPU)
for proc in procs where is_main_device[proc]:
    for task in proc_schedules[proc]:
        inputs = list_in[task]

Related Skills

node-connect

352.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。