GNN4IRSpecPAH
Graph Neural Network Prediction of Infrared Spectra of Interstellar Polycyclic Aromatic Hydrocarbons
Install / Use
/learn @zwAstroChem/GNN4IRSpecPAHREADME
GNN4IRSpecPAH
Graph Neural Network Prediction of Infrared Spectra of Polycyclic Aromatic Hydrocarbons
This is a python code for predicting infrared spectra of PAH molecules using graph neural network (GNN) models.
CODE USAGE INSTRUCTIONS
Conda env configuration
aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aiosignal 1.4.0
async-timeout 5.0.1
attrs 25.3.0
bzip2 1.0.8
ca-certificates 2025.7.14
cloudpickle 3.1.1
colorama 0.4.6
contourpy 1.3.0
cycler 0.12.1
deepchem 2.8.1.dev20250723140145
dgl 1.1.2
dgllife 0.3.2
filelock 3.18.0
fonttools 4.59.0
frozenlist 1.7.0
fsspec 2025.7.0
future 1.0.0
git 2.49.0
huggingface-hub 0.33.4
hyperopt 0.2.7
importlib-resources 6.5.2
jinja2 3.1.6
kiwisolver 1.4.7
libexpat 2.7.1
libffi 3.4.6
liblzma 5.8.1
libsqlite 3.50.3
libzlib 1.3.1
matplotlib 3.9.4
mpmath 1.3.0
multidict 6.6.4
networkx 3.2.1
openssl 3.5.1
pandas 2.3.1
pillow 11.3.0
pip 25.1.1
propcache 0.3.2
psutil 7.0.0
py4j 0.10.9.9
pyparsing 3.2.3
python 3.9.23
python-dateutil 2.9.0.post0
pytz 2025.2
pyyaml 6.0.2
rdkit 2025.3.3
regex 2024.11.6
safetensors 0.5.3
setuptools 80.9.0
sympy 1.14.0
threadpoolctl 3.6.0
tk 8.6.13
tokenizers 0.21.2
torch 2.1.2+cu121
torch-cluster 1.6.3+pt21cu121
torch-geometric 2.6.1
torch-scatter 2.1.2+pt21cu121
torch-sparse 0.6.18+pt21cu121
torch-spline-conv 1.2.2+pt21cu121
torchdata 0.7.1
torchvision 0.16.2+cu121
tqdm 4.67.1
transformers 4.53.3
tzdata 2025.2
ucrt 10.0.22621.0
vc 14.3
vc14_runtime 14.44.35208
wheel 0.45.1
yarl 1.20.1
Code
Model Code:
This file contains four graph neural network-based models (AFP, GCN, GAT, MPNN) and the MFP model based on traditional molecular fingerprinting for predicting infrared spectral data, along with AFP models using different training loss functions.
AFP model using EMD as training loss function: PAH_EMD_AFP.py
GAT model using EMD as training loss function: PAH_EMD_GAT.py
GCN model using EMD as training loss function: PAH_EMD_GCN.py
MFP model using EMD as training loss function: PAH_EMD_MFP.py
Based on MPNN using EMD as training loss function: PAH_EMD_MPNN.py
Based on HD as training loss function for AFP model: PAH_HD_AFP.py
Based on JSD as training loss function for AFP model: PAH_HD_JSD.py
AFP model using SIS as training loss function: PAH_HD_SIS.py
AFP model using TVD as training loss function: PAH_HD_TVD.py
Testing generalization capability of JSD-based AFP model: PAH_JSD_TEST_AFP.py
DATA
Includes processed SMILES string representations of 1,570 high- and low-frequency molecules from the NASA 3.2 PAH database, along with their corresponding spectral data.
Also includes processed SMILES string representations of 997 high- and low-frequency molecules (with carbon atom counts between 50-100) from the NASA 4.0 PAH database, along with their corresponding spectral data, used to test model generalization capability.
High-frequency data: 3.2_CH_Cleaner ALL_High_PAHs Dataset.pickle, 4.0_CH_Cleaner 50_100_ALL_High_PAHs Dataset.pickle.
Low-frequency data: 3.2_CH_Cleaner PAHs Dataset.pickle, 4.0_CH_Cleaner 50_100_PAHs Dataset.pickle.
Run
Code Execution: After installing the above environment configurations, ensure all data files and code files reside in the same directory before running. Prediction outputs will be saved in the Fold_Predictions directory.
The Best_model file within this directory contains the trained AFP model using JSD as the training loss function. This file is utilized when running PAH_JSD_TEST_AFP.py.
Limitations
The model is currently trained only on neutral PAHs and does not support charged molecules or isotopologues.
Predictions for molecules that significantly differ from the training dataset may have increased uncertainty.
