PnC
Official repository for the paper "Partition and Code: learning how to compress graphs" (NeurIPS'21) https://arxiv.org/abs/2107.01952
Install / Use
/learn @gbouritsas/PnCREADME
The PnC graph compression framework
This is the official pytorch repository for the NeurIPS'21 paper "Partition and Code: learning how to compress graphs" by Giorgos Bouritsas, Andreas Loukas, Nikolaos Karalias and Michael M. Bronstein
Paper | Slides
Abstract
Can we use machine learning to compress graph data? The absence of ordering in graphs poses a significant challenge to conventional compression algorithms, limiting their attainable gains as well as their ability to discover relevant patterns. On the other hand, most graph compression approaches rely on domain-dependent handcrafted representations and cannot adapt to different underlying graph distributions. This work aims to establish the necessary principles a lossless graph compression method should follow to approach the entropy storage lower bound. Instead of making rigid assumptions about the graph distribution, we formulate the compressor as a probabilistic model that can be learned from data and generalise to unseen instances. Our "Partition and Code" framework entails three steps: first, a partitioning algorithm decomposes the graph into elementary structures, then these are mapped to the elements of a small dictionary on which we learn a probability distribution, and finally, an entropy encoder translates the representation into bits. All three steps are parametric and can be trained with gradient descent. We theoretically compare the compression quality of several graph encodings and prove, under mild conditions, a total ordering of their expected description lengths. Moreover, we show that, under the same conditions, PnC achieves compression gains w.r.t. the baselines that grow either linearly or quadratically with the number of vertices. Our algorithms are quantitatively evaluated on diverse real-world networks obtaining significant performance improvements with respect to different families of non-parametric and parametric graph compressors.

Dependencies and Installation Instructions
Requirements:
- python 3.8.8
- pytorch>=1.8.0
- cudatoolkit>=10.2
- pytorch geometric>=1.6.3
- graph-tool>=2.37 (SBMs, dictionary atoms, etc.)
- scikit-network (partitioning algorithms)
- timeout_decorator
- wandb (optional: experiment monitoring)
- xlrd (optional: in case you want to precompute the subgraphs with subgraph isomorphism, we use it to keep track of their locations)
- xlutils (optional: as above)
- ogb (optional: in case you want to compress ogb datasets)
Recommended setup installations
conda create --name PnC python=3.8
conda activate PnC
conda install pytorch==1.8.0 cudatoolkit=10.2 -c pytorch
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu102.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu102.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.8.0+cu102.html
pip install torch-geometric
conda install -c conda-forge graph-tool
pip install scikit-network
pip install timeout_decorator
pip install wandb
--- optional ---
pip install xlrd
pip install xlutils
pip install ogb
Datasets
- TUDatasets: we provide the datasets originally obtained from here, as well as the splits that we used to evaluate PnC.
- ZINC: we provide the split obtained from the official repo. PLease download the dataset by running:
cd ./datasets/chemical/ZINC/
curl https://www.dropbox.com/s/feo9qle74kg48gy/molecules.zip?dl=1 -o molecules.zip -J -L -k
unzip molecules.zip -d ./
To reproduce the results from the paper, set --dataset <dataset> --dataset_name <dataset_name> --split given,
where dataset = {bioinformatics, social, chemical}, dataset_name = {MUTAG, PROTEINS, PTC, ZINC, IMDBBINARY, IMDBMULTI}.
We also provide infrastructure to compress other common datasets in the GNN literature
(see the file utils/generate_dataset.py for more details)
Usage Examples
The example commands below are selected to reproduce the numbers from Table 1 in the paper (expected results: 4.80, 2.51 and 2.45 total bpe respectively). We also provide an example command for compression of attributed graphs (expected result 4.45 total bpe). Finally, we give an example of out of distribution compression, where we train a neural compressor on MUTAG and test on ZINC (expected result: 2.63 total bpe)
Baselines
In the baselines folder we provide the computations needed to obtain the description lengths of the non-parametric models (null models and pure-partitioning based). For example, the following command will compute the description length of the MUTAG dataset after compressing via partitioning with the Louvain algorithm.
cd baselines
python non_parametric_models.py --dataset bioinformatics --dataset_name MUTAG --non_parametric_model Louvain --modified_gt_entropy False
-
To reproduce the rest of the results from the tables in the paper, simply set the parameter
--non_parametric_modelto one of the following values{uniform, edge_list, erdos_renyi, sbm, Louvain, PropagationClustering} -
NB: The description length of the partitioning-based baselines is a modified version of the entropy function provided by graph-tool. In graph-tool, the dl is computed for labelled graphs, while we are intersted in compressing unlabelled graphs. Thus, with our modification we drop the terms related to the ordering of the nodes (log n! - Σlog n_i!) (L164 in baselines/non_parametric_models.py). In order to also fit an SBM w.r.t. the modified description length, one should modify the blockmodel.py file of graph-tool (L1112 in the reference implementation). In case you don't modify the graph-tool function, please set
--modified_gt_entropy Falsein order to do the description length correction inside the script.
PnC
PnC + predefined partitioning algorithm
In the PnC_fixed_part folder we provide the training for the PnC + fixed partitioning variants, where we use a precomputed partitioning and then learn the dictionary and the parameters of the probability distribution. Example command:
cd PnC_fixed_part
python PnC_fixed_part.py --batch_size=1 --partitioning_algorithm=Louvain --dataset=bioinformatics --dataset_name=MUTAG --lr_dict=1 --num_epochs 50 --n_h_max_dict=10
We provide implementation for the following partitioning algorithms: partitioning_algorithm = {sbm, Louvain, PropagationClustering} (please refer to the paper for more details).
PnC + Neural Partitioning
In the PnC_neural_part folder we provide the implementation for our neural partitioning algorithm (iterative subgraph selection - please refer to the paper for more details). Example command:
cd PnC_neural_part
python PnC_neural_part.py --batch_size=16 --d_out=16 --num_layers 2 --dataset=bioinformatics --dataset_name=MUTAG --lr_dict=0.1 --lr_policy=0.001 --n_h_max_dict=10
PnC for attributed graphs
PnC can also compress attributed graphs. Here we provide a proof of concept implementation, that encodes the attributes using a uniform distribution. In our experiments we used the attributed MUTAG and PTC_MR datasets. Example command:
cd PnC_fixed_part
python PnC_fixed_part.py --batch_size=1 --partitioning_algorithm=Louvain --dataset=TUDataset --dataset_name=MUTAG --lr_dict=1 --n_h_max_dict=-1 --num_epochs 50 --node_attr_encoding uniform --edge_attr_encoding uniform
Out-of-distribution experiments
To compress different distributions using a pretrained neural compressor, you need to do the following:
- Before training, define
--n_max, i.e., the upper bound on the number of vertices, such that it is at least as large as the largest graph in the target distribution (this can be avoided - see the notes below to find out how). - Train the neural compressor in the dataset of your choice.
- Test on a different distribution by resuming from the appropriate folder. For example:
cd PnC_neural_part
python PnC_neural_part.py --batch_size=16 --d_out=16 --num_layers 2 --degree_as_tag False --dataset=bioinformatics --dataset_name=MUTAG --lr_dict=0.1 --lr_policy=0.001 --n_h_max_dict=10 --results_folder ood_experiments --n_max 37
(...training...)
python PnC_neural_part.py --batch_size=16 --d_out=16 --num_layers 2 --degree_as_tag False --dataset=chemical --dataset_name=ZINC --n_h_max_dict=10 --n_max 37 --resume_folder ../datasets/bioinformatics/MUTAG/results/ood_experiments/0/MPNN_sparse/checkpoints --resume True --mode test
Important files
- environment.py is the main module from which we coordinate the neural compressor, i.e., it defines the encoding schemes, keeps track of the dictionary and updates it, maps subgraphs to dictionary atoms, computes the description length, etc.
- encoding_costs.py implements various parametric and non-parametric encodings as the ones we have used in the paper.
- probabilistic_model.py contains the implementation of the learnable parameters of the dictionary (membership variables) and those of the probabilistic model.
- [agent_fixed_part.py](https://github.com/gbouritsas/PnC/blob/master/PnC_fixed_part/agent_fixed_p
Related Skills
proje
Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
flutter-tutor
Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d
