GraphSnapShot
GraphSnapShot: Caching Local Structure for Fast Graph Learning [Efficient ML System]
Install / Use
/learn @NoakLiu/GraphSnapShotREADME
GraphSnapShot
🔥🔥🔥 News 9/12: Added CUDA kernel implementation for GraphSnapShot!!!
🔥🔥🔥 News 5/21: GraphSnapShot has been accepted to MLArchSys 2025 as Oral Presentation!!!
🔥🔥🔥 News 4/1: GraphSnapShot has been accepted to MLsys 2025 YPS!!!
🔥🔥🔥 News 3/25: GraphSnapShot has been accepted to MASC-SLL 2025!!!
Up to 30% training acceleration and 73% memory reduction for lossless graph ML training with GraphSnapShot!
GraphSnapShot is a framework for caching local structure for fast graph learning, it can achieve fast storage, retrieval and computation for graph learning at large scale. It can quickly store and update the local topology of graph structure, just like take snapshots of the graphs.



3 kinds of system design
1. proto - implemented by torch
2. dglsampler-simple - implemented with baseline of MultiLayerSampler in dgl
3. dglsampler - implemented with baseline of MultiLayerSampler in dgl
For dglsampler re-design 3 system design strategies
FBL: full batch load
OTF: partial cache refresh (on the fly) snapshot
FCR: full cache refresh snapshot
In detailed:
1. All methods in OTF and FCR has two modes: independent cache and shared cache
2. OTF has 4 methods, which are the combination of (full refresh, partial refresh) x (full fetch, partial fetch)
Deployment:
FBL implementation is same as the MultiLayerSampler implemented in dgl.
To deploy GraphSnapShot, Samplers in SSDReS_Sampler by cd SSDReS_Samplers, and then find the following file
NeighborSampler_OTF_struct.py
NeighborSampler_OTF_nodes.py
NeighborSampler_FCR_struct.py
NeighborSampler_FCR_nodes.py
The sampler code can be found at
vim ~/anaconda3/envs/dglsampler/lib/python3.9/site-packages/dgl/sampling/neighbor.py
Add samplers code in SSDReS_Sampler into the neighbor_sampler.py in dgl as in the path above and save the changes.
vim ~/anaconda3/envs/dglsampler/lib/python3.9/site-packages/dgl/dataloading/neighbor_sampler.py
Then you can deploy OTF and FCR samplers at node-level and struct-level from neighbor_sampler and create objects of those samplers.
FBL in execution
https://docs.dgl.ai/en/0.8.x/_modules/dgl/sampling/neighbor.html#sample_neighbors
https://docs.dgl.ai/en/0.9.x/generated/dgl.dataloading.NeighborSampler.html
FCR in execution
https://github.com/NoakLiu/GraphSnapShot/assets/116571268/ed701012-9267-4860-845b-baf1c39c317c
OTF in execution
https://github.com/NoakLiu/GraphSnapShot/assets/116571268/6fe1a566-d4e9-45ae-b654-676a2e4d6a58
FCR-SC, OTF-SC, FBL comparison (Note: SC is short for shared cache)
https://github.com/NoakLiu/GraphSnapShot/assets/116571268/baed1610-952c-4455-9ecf-015450b482dc
Two types of samplers:
node-level: split graph into graph_static and graph_dynamic, enhance the capability for CPU-GPU co-utilization.
structure-level: reduce the inefficiency of resample whole k-hop structure for each node, use static-presample and dynamic-resample for structure retrieval acceleration.
Downsteam Task:
MultiLayer GCN - ogbn_arxiv / ogbn_products (homo)
MultiLayer SGC - ogbn_arxiv / ogbn_products (homo)
MultiLayer GraphSAGE - ogbn_arxiv / ogbn_products (homo)
MultiLayer RelGraphConv - ogbn_mag (hete)
Datasets:
ogbn_arxiv - node classification (homo)
ogbn_products - node classification (homo)
ogbn_mag - node classification (hete)
<p align="center">
<img src="./results/stats/ogbn-arxiv_degree_distribution.png" width="200" />
<img src="./results/stats/ogbn-products_degree_distribution.png" width="200" />
<img src="./results/stats/ogbn-mag_degree_distribution.png" width="200" />
</p>
| Feature | OGBN-ARXIV | OGBN-PRODUCTS | OGBN-MAG | |-------------------|--------------|---------------|---------------| | Dataset Type | Citation Network | Product Purchase Network | Microsoft Academic Graph | | Number of Nodes | 17,735 | 24,019 | 132,534 | | Number of Edges | 116,624 | 123,006 | 1,116,428 | | Feature Dimension | 128 | 100 | 50 | | Number of Classes | 40 | 89 | 112 | | Number of Train Nodes | 9,500 | 12,000 | 41,351 | | Number of Validation Nodes | 3,500 | 2,000 | 10,000 | | Number of Test Nodes | 4,735 | 10,000 | 80,183 | | Supervised Task | Node Classification | Node Classification | Node Classification |
Design of FBL

Design of OTF

Design of FCR

An Example of Memory Reduction for Original Graph
Graph(num_nodes=169343, num_edges=1166243,
ndata_schemes={'year': Scheme(shape=(1,), dtype=torch.int64), 'feat': Scheme(shape=(128,), dtype=torch.float32)}
edata_schemes={})
Dense Filter -->Dense Graph (degree>30)
Graph(num_nodes=5982, num_edges=65847,
ndata_schemes={'year': Scheme(shape=(1,), dtype=torch.int64), 'feat': Scheme(shape=(128,), dtype=torch.float32), '_ID': Scheme(shape=(), dtype=torch.int64)}
edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
Cached Graph (cached for FCRSampler)
Graph(num_nodes=5982, num_edges=30048,
ndata_schemes={'year': Scheme(shape=(1,), dtype=torch.int64), 'feat': Scheme(shape=(128,), dtype=torch.float32), '_ID': Scheme(shape=(), dtype=torch.int64)}
edata_schemes={'_ID': Scheme(shape=(), dtype=torch.int64)})
Overall, achieve a memory reduction from num_edges=65847 to num_nodes=5982, num_edges=30048, edges memory reduction by by 45.6%
Dense Graph GraphSnapShot Cache for SSDReS_Samplers
Methods
- For sparse graphs, FBL method will be directedly deployed
- For dense graphs, SSDReS methods will be deployed
SSDReS Samplers
-
dgl samplers
- hete
- FCR_hete
- FCR_SC_hete
- OTF((PR, FR)x(PF, FF))_hete
- OTF((PR, FR)x(PF, FF))_SC_hete
- homo
- FCR
- FCR_SC
- OTF((PR, FR)x(PF, FF))
- OTF((PR, FR)x(PF, FF))_SC
- hete
-
dgl samplers simple
- hete
- FCR_hete
- FCR_SC_hete
- OTF_hete
- OTF_SC_hete
- homo
- FCR
- FCR_SC
- OTF
- OTF_SC
- hete
Deployment Sequence
-
For homograph
-
- python div_graph_by_deg_homo.py --> dense graph, sparse graph
-
- deploy homo SSDReS samplers such as FCR, FCR-SC, OTF((PR, FR)x(PF, FF)), OTF((PR, FR)x(PF, FF))-SC on dense graph
-
- deploy FBL on sparse graph
-
-
For hetegraph
-
- python div_graph_by_deg_hete.py --> dense graph, sparse graph
-
- deploy homo SSDReS samplers such as FCR_hete, FCR-SC_hete, OTF((PR, FR)x(PF, FF))_hete, OTF((PR, FR)x(PF, FF))-SC_hete on dense graph
-
- deploy FBL on sparse graph
-
Figures for Runtime Memory reduction

Figures for Memory reduction



Figures for GPU reduction



Analysis
The key point of GraphSnapShot is to cache the local structure instead of whole graph input for memory reduction and sampling efficiency.
Deployment on homo-graphs Import
from dgl.dataloading import (
DataLoader,
MultiLayerFullNeighborSampler,
NeighborSampler,
MultiLayerNeighborSampler,
BlockSampler,
NeighborSampler_FCR_struct,
NeighborSampler_FCR_struct_shared_cache,
NeighborSampler_OTF_struct_FSCRFCF,
NeighborSampler_OTF_struct_FSCRFCF_shared_cache,
NeighborSampler_OTF_struct_PCFFSCR_shared_cache,
NeighborSampler_OTF_struct_PCFFSCR,
NeighborSampler_OTF_struct_PCFPSCR_SC,
NeighborSampler_OTF_struct_PCFPSCR,
NeighborSampler_OTF_struct_PSCRFCF_SC,
NeighborSampler_OTF_struct_PSCRFCF,
# NeighborSampler_OTF_struct,
# NeighborSampler_OTF_struct_shared_cache
)
Method Explanation - homo
NeighborSampler_FCR_struct: Fully Cache Refresh, with each hop has unique cached frontier
NeighborSampler_FCR_struct_shared_cache: Fully Cache Refresh with Shared Cache, with all hop has shared cached frontier
NeighborSampler_OTF_struct_FSCRFCF:
NeighborSampler_OTF_struct_FSCRFCF_shared_cache:
NeighborSampler_OTF_struct_PCFFSCR:
NeighborSampler_OTF_struct_PCFFSCR_shared_cache:
NeighborSampler_OTF_struct_PCFPSCR:
NeighborSampler_OTF_struct_PCFPSCR_SC:
NeighborSampler_OTF_struct_PSCRFCF:
NeighborSampler_OTF_struct_PSCRFCF_SC:
Deployment
# FBL
sampler = NeighborSampler(
[5, 5, 5], # fanout for [layer-0, layer-1, layer-2]
prefetch_node_feats=["feat"],
prefetch_labels=["label"],
fused=fused_sampling,
)
# FCR
sampler = NeighborSampler_FCR_struct(
g=g,
fanouts=[5,5,5], # fanout for [layer-0, layer-1, layer-2] [2,2,2]
alpha=1.5, T=50,
prefetch_node_feats=["feat"],
prefetch_labels=["label"],
fused=fused_sampling,
)
# FCR shared cache
sampler = NeighborSampler_FCR_struct_shared_cache(
g=g,
fanouts=[5,5,5], # fanout for [layer-0, layer-1, layer-2] [2,2,2]
alpha=1.5, T=50,
prefetch_node_feats=["feat"],
prefetch_labels=["label"],
fused=fused_sampling,
)
# OTF
sampler = NeighborSampler_OTF_struct_FSCRFCF(
g=g,
fanouts=[5,5,5], # fanout for [layer-0, layer-1, layer-2] [4,4,4]
amp_rate=2, refresh_rate=0.3, T=50, #3, 0.4
prefetch_node_feats=["feat"],
prefetch_labels=["label"],
fused=fused_
Related Skills
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
last30days-skill
17.2kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
