FlashSparse
FlashSparse significantly reduces the computation redundancy for unstructured sparsity (for SpMM and SDDMM) on Tensor Cores through a Swap-and-Transpose mapping strategy. FlashSparse is accepted by PPoPP 2025.
Install / Use
/learn @ParCIS/FlashSparseREADME
FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores
FlashSparse is accepted by PPoPP 2025. See the published version of the paper. FlashSparse significantly reduces the computation redundancy for unstructured sparsity (for SpMM and SDDMM) on Tensor Cores through a Swap-and-Transpose mapping strategy.
![]()
1. Clone this project.
git clone --recursive https://github.com/ParCIS/FlashSparse.git
- Requirements:
Ubuntu 16.04+cmake >= 3.29CUDA >= 11.8- one NVIDIA RTX4090 GPU and one H100 PCIe GPU.
2. Environment Setup.
Conda environments need to be set up on machines with H100 PCIe and RTX4090 GPUs following the steps below.
2.1 Install via Conda.
- 2.1.1 Install
condaon system. (Toturial). - 2.1.2 Create a
condaenvironment:
conda create -n env_name python=3.9
- 2.1.3 Install
PyTorch(Toturial):
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
3. Install FlashSparse.
cd FlashSparse/
bash compile.sh
4. Download datasets. (optional)
cat dataset.txt
The sparse matrices can be downloaded from the SuiteSparse repository, while GNN-related datasets are available on the official websites of DGL and PyG. After downloading, the matrices can be partitioned and preprocessed using FS_Block_gpu.preprocess_gpu_fs(). The preprocessed results are then used for computation with FlashSparse.(total 515 sparse matrices).
5. Install Sparse-Kernel Baselines.
Install RoDe, Sputnik, GNNAdvisor, GE-SpMM, cuSPARSE, DTC-SpMM and TC-GNN:
cd Baseline/
bash compile.sh
6. Install GNNs-Framework Baselines.
6.1 Install Deep Graph Library (DGL) (Toturial):
pip install dgl -f https://data.dgl.ai/wheels/torch-2.1/cu118/repo.html
6.2 Install Pytorch-Geometric (PyG) (Toturial):
pip install torch_geometric
7. Running FlashSparse on H100 PCIe and RTX4090 GPUs.
7.1 SpMM test
- Go to project
eva/kernel/spmm/directory.bash ./test_spmm_shell.shto run all SpMM experiments. (about 200 minutes)- Check the results in
result/FlashSparse/spmm/*.csv.
7.2 SDDMM test
- Go to project
eva/kernel/sddmm/directory.bash ./test_sddmm_shell.shto run all SDDMM experiments. (about 100 minutes)- Check the results in
result/FlashSparse/sddmm/*.csv.
7.3 GCN and AGNN tests
- Go to project
eva/end2end/gcn/directory.python eva_gcn_fs.pyto run GCN experiments.python eva_gcn_baseline.pyto run GCN experiments.- Check the results in
result/FlashSparse/gcn/fs_gcn_128.csv. (about 5 minutes)- Check the results in
result/Baseline/agnn/baseline_gcn_128.csv. (about 15 minutes)
- Go to project
eva/end2end/agnn/directory.python eva_agnn_fs.pyto run AGNN experiments.python eva_agnn_baseline.pyto run AGNN experiments.- Check the results in
result/FlashSparse/agnn/fs_agnn_32.csv. (about 5 minutes)- Check the results in
result/Baseline/agnn/baseline_agnn_32.csv. (about 15 minutes)
8. Running Baselines on H100 PCIe and RTX4090 GPUs (Optional).
8.1 Evaluating RoDe, Sputnik and cuSPARSE.
- Go to project
Baseline/RoDe/script/directory.bash download.shto download the same 515 matices in a specific format for RoDe. (optional)bash test_spmm_shell.shto run all SpMM experiments. (about 300 minutes)bash test_sddmm_shell.shto run all SDDMM experiments. (about 300 minutes)- Check the results in
result/Baseline/spmm/rode*.csvandresult/Baseline/sddmm/rode*.csv.
8.2 Evaluating DTC-SpMM.
- Go to project
Baseline/DTC-SpMM/directory.bash test_spmm_shell.shto run all SpMM experiments. (about 20 minutes)- Check the results in
result/Baseline/spmm/dtc*.csv.
8.3 Evaluating GNNAdvisor, TCGNN, GE-SpMM.
- Go to project
eva/kernel/spmm/directory.bash test_spmm_shell_base.shto run all SpMM experiments. (about 100 minutes)- Check the results in
result/Baseline/spmm/base*.csv.
- Go to project
eva/kernel/sddmm/directory.bash test_sddmm_shell_base.shto run all SDDMM experiments. (about 20 minutes)- Check the results in
result/Baseline/sddmm/base*.csv.
8.4 Summarize the results in all baselines.
- Go to project
result/Baseline/spmm/directory.python summarize.pyto summarize all results.
- Go to project
result/Baseline/sddmm/directory.python summarize.pyto summarize all results.
9 Reproducing the experinmental figures and tables in FlashSparse.
9.1 Reproduce the Figure 11 and Table 5. (both on H100 and RTX4090)
- Go to project
eva/plot/kernel_spmm/directory.python plot_figure11_ac.pyand check the figure infigure11.png(The plotted figure11.png on H100 corresponds to Figure 11(a) in the paper, and on RTX4090 corresponds to Figure 11(c) in the paper.)python plot_figure11_bd.pyand check the figure infigure11_sub.png. (The plotted figure11_sub.png on H100 corresponds to Figure 11(b) in the paper, and on RTX4090 corresponds to Figure 11(d) in the paper.)python profile_table5.pyand check the result intable5.txt. (The profiled table5.txt on H100 corresponds to Table5(left) in the paper, and on RTX4090 corresponds to Table5(right) in the paper.)
9.2 Reproduce the Figure 13 and Table 6. (both on H100 and RTX4090)
- Go to project
eva/plot/kernel_sddmm/directory.python plot_figure13_a.pyand check the figure infigure13(a).png.python plot_figure13_b.pyand check the figure infigure13(b).png. (The plotted figure13(a).png and figure13(b).png on H100 correspond to Figure 13(a)(b) in the paper, and on RTX4090 corresponds to Figure 13(c)(d) in the paper.)python profile_table6.pyand check the result intable6.txt. (The profiled table6.txt on H100 corresponds to Table6(left) in the paper, and on RTX4090 corresponds to Table6(right) in the paper.)
9.3 Reproduce the Figure 12. (only on H100 or RTX4090)
- Go to project
eva/plot/ablation/memory/directory.python spmm.pyand check the result inmemory_spmm.csv. (about 20 minutes)python sddmm.pyand check the result inmemory_sddmm.csv. (about 20 minutes)
python plot_spmm.pyand check the figure inspmm_mem.png.python plot_sddmm.pyand check the figure insddmm_mem.png.
9.4 Reproduce the Figure 14. (both on H100 and RTX4090)
- Go to project
eva/plot/ablation/throughput/directory.python plot_spmm.pyand check the figure infigure14(a).png.python plot_sddmm.pyand check the figure result infigure14(b).png. (The plotted figure14(a).png and figure14(b).png on H100 correspond to Figure 14(a)(b) in the paper, and on RTX4090 corresponds to Figure 14(c)(d) in the paper.)
9.5 Reproduce the Figure 15. (both on H100 and RTX4090)
- Go to project
eva/plot/ablation/access/directory.python plot.pyand check the figure infigure15.png. (The plotted figure15.png on H100 correspond to Figure 15(left) in the paper, and on RTX4090 corresponds to Figure 15(right) in the paper.)
9.6 Reproduce the Table 7. (only on H100 or RTX4090)
- Go to project
eva/plot/ablation/format/directory.python format.pyand check the result inresult.csv. (about 25 minutes)python profile.pyand check the output`.
9.7 Reproduce the Figure 16. (only on H100 or RTX4090)
- Go to project
eva/plot/gcn/directory.python plot.pyand check the figure infigure16_gcn.png.
- Go to project
eva/plot/agnn/directory.python plot.pyand check the figure infigure16_agnn.png.
9.8 Reproduce the Table 8. (only on H100 or RTX4090)
- Go to project
eva/accuracy/gcn/directory.python eva_gcn.py. (about 1 minutes)- Check the result in
result/Baseline/gcn/accuracy.csv.
Related Skills
node-connect
336.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
336.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.0kCommit, push, and open a PR
