AI2BMD
AI-powered ab initio biomolecular dynamics simulation
Install / Use
/learn @microsoft/AI2BMDREADME
AI<sup>2</sup>BMD: AI-powered ab initio biomolecular dynamics simulation
Contents
- Overview
- Get Started
- Datasets
- System Requirements
- Advanced Setup
- Related Research
- Citation
- License
- Disclaimer
- Contacts
Overview
AI<sup>2</sup>BMD is a program for efficiently simulating protein molecular dynamics with ab initio accuracy. This repository contains the simulation program, datasets, and public materials related to AI<sup>2</sup>BMD. The main content of AI<sup>2</sup>BMD is published on Nature.
Here is an animation to illustrate how AI<sup>2</sup>BMD works.
https://github.com/user-attachments/assets/912a3e5a-c465-4dc7-8c2d-9f7807cac2a7
Get Started
The source code of AI<sup>2</sup>BMD is hosted in this repository.
We package the source code and runtime libraries into a Docker image, and provide a Python launcher program to simplify the setup process.
To run the simulation program, you don't need to clone this repository. Simply download scripts/ai2bmd and launch it (Python >=3.7 and docker enviroments are required).
We can run a molecular dynamics simulation as follows.
# skip the following two lines if you've already set up the launcher
wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/scripts/ai2bmd'
chmod +x ai2bmd
# download the Chignolin protein structure data file
wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/examples/chig.pdb'
# download the preprocessed and solvated Chignolin protein structure data files
wget --directory-prefix=chig_preprocessed 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/examples/chig_preprocessed/chig-preeq.pdb'
wget --directory-prefix=chig_preprocessed 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/examples/chig_preprocessed/chig-preeq-nowat.pdb'
# pull the docker image from the container registry
docker pull ghcr.io/microsoft/ai2bmd:latest
# launch the program, with all simulation parameters set to default values
# you may need to "sudo" the following line if the docker group is not configured for the user
./ai2bmd --prot-file chig.pdb --preprocess-dir chig_preprocessed --preeq-steps 0 --sim-steps 1000 --record-per-steps 1
Here we use a very simple protein Chignolin as an example.
The program will run a simulation with the default parameters.
The results will be placed in a new directory Logs-chig.
The directory contains the simulation trajectory file:
- chig-traj.traj: The full trajectory file in ASE binary format.
Note: Currently, AI<sup>2</sup>BMD supports MD simulations for proteins with neutral terminal caps (ACE and NME), single chain and standard amino acids.
Datasets
Protein Unit Dataset
The Protein Unit Dataset covers about 20 million conformations for dipeptides calculated at DFT level. It can be downloaded with the following commands:
# skip the following two lines if you've already set up the launcher
wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/scripts/ai2bmd'
chmod +x ai2bmd
# you may need to "sudo" the following line if the docker group is not configured for the user
./ai2bmd --download-training-data
When it finishes, the current working directory will be populated by the numpy data files (*.npz).
AIMD-Chig Dataset
The AIMD-Chig dataset consists of 2 million conformations of the 166-atom Chignolin, along with their corresponding potential energy and atomic forces calculated using Density Functional Theory (DFT) at the M06-2X/6-31G* level.
-
Read the article AIMD-Chig: Exploring the conformational space of a 166-atom protein Chignolin with ab initio molecular dynamics.
-
Find the story The first whole conformational molecular dynamics dataset for proteins at ab initio accuracy and the novel computational technologies behind it.
-
Get the dataset AIMD-Chig.
System Requirements
Hardware Requirements
The AI<sup>2</sup>BMD program runs on x86-64 GNU/Linux systems. We recommend a machine with the following specs:
- CPU: 8+ cores
- Memory: 32+ GB
- GPU: CUDA-enabled GPU with 8+ GB memory
The program has been tested on the following GPUs:
- A100
- V100
- RTX A6000
- Titan RTX
Software Requirements
The program has been tested on the following systems:
- OS: Ubuntu 20.04, Docker: 27.1
- OS: ArchLinux, Docker: 26.1
Advanced Setup
Environment
The runtime libraries and requirents are packed into a Docker image for convenience and practicality. Before launching the Docker image, you need to install the Docker software (see https://docs.docker.com/engine/install/ for more details) and add the user to docker group with the following commands:
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
Protein File Preparation
The input file for AI<sup>2</sup>BMD should be .pdb format.
If hydrogen atoms are missing in the .pdb file, hydrogens should be added.
Then, the protein should be capped with ACE (acetyl) at the N-terminus and NME (N-methyl) at the C-terminus. These steps can be efficiently done using the PyMOL software with the following commands as a reference.
from pymol import cmd
pymol.finish_launching()
cmd.load("your_protein.pdb","molecule")
cmd.h_add("molecule") # Adding hydrogen
cmd.wizard("mutagenesis")
cmd.get_wizard().set_n_cap("acet")
selection = "/%s//%s/%s" % (molecule, chain, resi) #selection of N-term
cmd.get_wizard().do_select(selection)
cmd.get_wizard().apply()
cmd.get_wizard().set_c_cap("nmet")
selection = "/%s//%s/%s" % (molecule, chain, resi) #selection of N-term
cmd.get_wizard().do_select(selection)
cmd.get_wizard().apply()
cmd.set_wizard()
Next, you can use AmberTools' pdb4amber utility to adjust atom names in the .pdb file, specifically ensuring compatibility for ACE and NME as required by ai2bmd. The atom names for ACE and NME should conform to the following:
- ACE: C, O, CH3, H1, H2, H3
- NME: N, CH3, H, HH31, HH32, HH33
pdb4amber -i your_protein.pdb -o processed_your_protein.pdb
In addition, please verify that there are no TER separators in the protein chain. Additionally, the residue numbering should start from 1 without gaps.
After completing the above steps, your .pdb file should resemble the following format:
ATOM 1 H1 ACE 1 10.845 8.614 5.964 1.00 0.00 H
ATOM 2 CH3 ACE 1 10.143 9.373 5.620 1.00 0.00 C
ATOM 3 H2 ACE 1 9.425 9.446 6.437 1.00 0.00 H
ATOM 4 H3 ACE 1 9.643 9.085 4.695 1.00 0.00 H
ATOM 5 C ACE 1 10.805 10.740 5.408 1.00 0.00 C
ATOM 6 O ACE 1 10.682 11.417 4.442 1.00 0.00 O
...
ATOM 170 N NME 12 9.499 8.258 10.367 1.00 0.00 N
ATOM 171 H NME 12 9.393 8.028 11.345 1.00 0.00 H
ATOM 172 CH3 NME 12 8.845 7.223 9.569 1.00 0.00 C
ATOM 173 HH31 NME 12 7.842 6.990 9.925 1.00 0.00 H
ATOM 174 HH32 NME 12 8.798 7.589 8.543 1.00 0.00 H
ATOM 175 HH33 NME 12 9.418 6.305 9.435 1.00 0.00 H
END
You can also take the protein files in examples folder as reference. Note, currently, the machine learning potential doesn't support the protein with disulfide bonds well. We will update it soon.
Preprocess
During the preprocess, the solvated sytem is built and encounted energy minimization and alternative pre-equilibrium stages. Currently, AI<sup>2</sup>MD provides two methods for the preprocess via the argument preprocess_method.
If you choose the FF19SB method, the system will go through solvation, energy minimization, heating and several pre-equilibrium stages. To accelerate the preprocess by multiple CPU cores and GPUs, you should get AMBER software packages and modify the corresponding commands in src/AIMD/preprocess.py.
If you choose the AMOEBA method, the system will go through solvation and energy minimization stages. We highly recommend to perform pre-equilibrium simulations to let the simulation system fully relaxed.
Simulation
AI<sup>2</sup>BMD provides two modes for performing the production simulations via the argument mode. The default mode of fragment represents protein is fragmented into dipeptides and then calculated by the machine learning potential in every simulation step.
AI<sup>2</sup>BMD also supports to train the machine learning potential by yourselves and perform simulations without fragmentation. The visnet mode represents the potential energy and atomic forces of the protein are directly calculated by the ViSNet model as a whole molecule without fragmentation. When using this mode, you need to train ViSNet model with the data of the molecules by yourself, upload the model to src/ViSNet and give the corresponding value to the argument ckpt-type. In this way, you can use AI<sup>2</sup>BMD simulation program to simulate any kinds of molecules beyond proteins. To train the ViSNet model by yourselves, please check out the branch ViSNet for the source code, instructi
