MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design

mofdiff is a diffusion model for generating coarse-grained MOF structures. This codebase also contains the code for deconstructing/reconstructing the all-atom MOF structures to train MOFDiff and assemble CG structures generated by MOFDiff.

paper | data and pretained models

If you find this code useful, please consider referencing our paper:

@inproceedings{
fu2024mofdiff,
title={{MOFD}iff: Coarse-grained Diffusion for Metal-Organic Framework Design},
author={Xiang Fu and Tian Xie and Andrew Scott Rosen and Tommi S. Jaakkola and Jake Allen Smith},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=0VBsoluxR2}
}

Installation
Process data
Training
Generating MOF structures
Assemble all-atom MOFs
Relax MOFs
GCMC simulations
Responsible AI FAQ
Contributing
Acknowledgement

Installation

We recommend using mamba rather than conda to install the dependencies to increase installation speed. First install mamba following the intructions in the mamba repository. (Note: a reqirements.txt mirror of env.yml is provided for compatibility with CI/CD; however, we do not recommend building the environment with pip.)

Install dependencies via mamba:

mamba env create -f env.yml

Then install mofdiff as a package:

pip install -e .

We use MOFid for preprocessing and analysis. To perform these steps, install MOFid following the instruction in the MOFid repository. The generative modeling and MOF simulation portions of this codebase do not depend on MOFid.

Configure the .env file to set correct paths to various directories, dependent on the desired functionality. An example .env file is provided in the repository.

For model training, please set the learning-related paths.

PROJECT_ROOT: the parent MOFDiff directory
DATASET_DIR: the directory containing the .lmdb file produced by processing the data
LOG_DIR: the directory to which logs will by written
HYDRA_JOBS: the directory to which Hydra output will be written
WANDB_DIR: the directory to which WandB output will be written

For MOF relaxation and structureal property calculations, please additionally set the Zeo++ path.

ZEO_PATH: path to the Zeo++ "network" binary

For GCMC simulations, please additionally set the GCMC-related paths.

RASPA_PATH: the RASPA2 parent directory
RASPA_SIM_PATH: path to the RASPA2 "simulate" binary
EGULP_PATH: path to the eGULP "egulp" binary
EGULP_PARAMETER_PATH: the directory containing the eGULP "MEPO.param" file

Process data

You can download the preprocessed BW-DB data from Zenodo (recommended). To use the preprocessed data, please extract bw_db.tar.gz into ${oc.env:DATASET_DIR}.

Alternatively, you can download the BW-DB raw data from Materials Cloud to ${raw_path} and preprocess with the following command. This step requires MOFid.

python mofdiff/preprocessing/extract_mofid.py --df_path ${raw_path}/all_MOFs_screening_data.csv --cif_path ${raw_path}/cifs --save_path ${raw_path}/mofid
python mofdiff/preprocessing/preprocess.py --df_path ${raw_path}/all_MOFs_screening_data.csv --mofid_path ${raw_path}/mofid --save_path ${raw_path}/graphs
python mofdiff/preprocessing/save_to_lmdb.py --graph_path ${raw_path}/graphs --save_path ${raw_path}/lmdbs

The preprocessing inovlves 3 steps:

Extract the MOFid for all structures (CPU).
Construct CG MOF data objects from MOFid deconstruction results (CPU or GPU).
Save the CG MOF objects to an LMDB database (relatively fast).

The entire preprocessing process for BW-DB may take several days (depending on the CPU/GPU resources).

Training

training the building block encoder

Before training the diffusion model, we need to train the building block encoder. The building block encoder is a graph neural network that encodes the building blocks of MOFs. The building block encoder is trained with the following command:

python mofdiff/scripts/train.py --config-name=bb

The default output directory is ${oc.env:HYDRA_JOBS}/bb/${expname}/. oc.env:HYDRA_JOBS is configured in .env. expname is configured in configs/bb.yaml. We use hydra for config management. All configs are stored in configs/ You can override the default output directory with command line arguments. For example:

python mofdiff/scripts/train.py --config-name=bb expname=bwdb_bb_dim_64 model.latent_dim=64

Logging is done with wandb by default. You need to login to wandb with wandb login before training. The training logs will be saved to the wandb project mofdiff. You can also override the wandb project with command line arguments or disable wandb logging by removing the wandb entry in the config as demonstrated here.

training coarse-grained diffusion model for MOFs

The output directory where the building block encoder is saved: bb_encoder_path is needed for training the diffusion model. By default, this path is ${oc.env:HYDRA_JOBS}/bb/${expname}/, as defined above. Train/validation splits are defined in splits, with examples provided for BW-DB. With the building block encoder trained to convergence, train the CG diffusion model with the following command:

python mofdiff/scripts/train.py data.bb_encoder_path=${bb_encoder_path}

For BW-DB, training the building block encoder takes roughly 3 days and training the diffusion model takes roughly 5 days on a single NVIDIA V100 GPU.

Generating CG MOF structures

Pretrained models can be found here. To use the pretrained models, please extract pretrained.tar.gz and bb_emb_space.tar.gz into ${oc.env:PROJECT_ROOT}/pretrained.

With a trained CG diffusion model ${diffusion_model_path}, generate random CG MOF structures with the following command, where ${bb_cache_path} is the path to the trained building encoder bb_emb_space.pt, either sourced from the pretrained models or generated as described above.

python mofdiff/scripts/sample.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path}

To optimize MOF structures for a property defined in BW-DB (e.g., CO2 adsorption working capacity) use the following command, where ${data_path} is the path to the processed data data.lmdb, either sourced from the pretrained models or generated as described above.

python mofdiff/scripts/optimize.py --model_path ${diffusion_model_path} --bb_cache_path ${bb_cache_path} --data_path ${data_path} --property "working_capacity_vacuum_swing [mmol/g]" --target_v 15.0

Available arguments for sample.py and optimize.py can be found in the respective files. The generated CG MOF structures will be saved in ${sample_path}=${diffusion_model_path}/${sample_tag} as samples.pt.

The CG structures generated with the diffusion model are not guaranteed to be realizable. We need to assemble the CG structures to recover the all-atom MOF structures. The following sections describe how to assemble the CG MOF structures, and all steps further do not require a GPU.

Assemble all-atom MOFs

Assemble all-atom MOF structures from the CG MOF structures with the following command:

python mofdiff/scripts/assemble.py --input ${sample_path}/samples.pt

This command will assemble the CG MOF structures in ${sample_path} and save the assembled MOFs in ${sample_path}/assembled.pt. The cif files of the assembled MOFs will be saved in ${sample_path}/cif. If the assembled MOFs came from property-driven optimization, the optimization arguments are saved to ${sample_path}/opt_args.json.

Relax MOFs and compute structural properties

The assembled structures may not be physically plausible. These MOF structures are relaxed using the UFF force field with LAMMPS. LAMMPS has already been installed as part of the environment if you have followed the installation instructions in this README. The script for relaxing the MOF structures also compute structural properties (e.g., pore volume, surface area, etc.) with Zeo++ and the mofids of the generated MOFs with MOFid. The respective packages should be installed following the instructions in the respective repositories, and the corresponding paths should be added to .env as outlined above. Each step should take no more than a few minutes to complete on a single CPU. We use multiprocessing to parallelize the computation.

Relax MOFs and compute structural properties with the following command:

python mofdiff/scripts/uff_relax.py --input ${sample_path}

This command will relax the assembled MOFs in ${sample_path}/cif and save the relaxed MOFs in ${sample_path}/relaxed. The structural properties of the relaxed MOFs will be saved in `${sample_path}/relaxed/zeo_props_relax

MOFDiff

Install / Use

README