AI2BMD: AI-powered ab initio biomolecular dynamics simulation

Overview
Get Started
Datasets
System Requirements
Advanced Setup
Related Research
Citation
License
Disclaimer
Contacts

Overview

AI2BMD is a program for efficiently simulating protein molecular dynamics with ab initio accuracy. This repository contains the simulation program, datasets, and public materials related to AI2BMD. The main content of AI2BMD is published on Nature.

Here is an animation to illustrate how AI2BMD works.

https://github.com/user-attachments/assets/912a3e5a-c465-4dc7-8c2d-9f7807cac2a7

Get Started

The source code of AI2BMD is hosted in this repository. We package the source code and runtime libraries into a Docker image, and provide a Python launcher program to simplify the setup process. To run the simulation program, you don't need to clone this repository. Simply download scripts/ai2bmd and launch it (Python >=3.7 and docker enviroments are required).

We can run a molecular dynamics simulation as follows.

# skip the following two lines if you've already set up the launcher
wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/scripts/ai2bmd'
chmod +x ai2bmd
# download the Chignolin protein structure data file
wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/examples/chig.pdb'
# download the preprocessed and solvated Chignolin protein structure data files
wget --directory-prefix=chig_preprocessed 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/examples/chig_preprocessed/chig-preeq.pdb'
wget --directory-prefix=chig_preprocessed 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/examples/chig_preprocessed/chig-preeq-nowat.pdb'
# pull the docker image from the container registry
docker pull ghcr.io/microsoft/ai2bmd:latest
# launch the program, with all simulation parameters set to default values
# you may need to "sudo" the following line if the docker group is not configured for the user
./ai2bmd --prot-file chig.pdb --preprocess-dir chig_preprocessed --preeq-steps 0 --sim-steps 1000 --record-per-steps 1

Here we use a very simple protein Chignolin as an example. The program will run a simulation with the default parameters.

The results will be placed in a new directory Logs-chig. The directory contains the simulation trajectory file:

chig-traj.traj: The full trajectory file in ASE binary format.

Note: Currently, AI2BMD supports MD simulations for proteins with neutral terminal caps (ACE and NME), single chain and standard amino acids.

Datasets

Protein Unit Dataset

The Protein Unit Dataset covers about 20 million conformations for dipeptides calculated at DFT level. It can be downloaded with the following commands:

# skip the following two lines if you've already set up the launcher
wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/scripts/ai2bmd'
chmod +x ai2bmd
# you may need to "sudo" the following line if the docker group is not configured for the user
./ai2bmd --download-training-data

When it finishes, the current working directory will be populated by the numpy data files (*.npz).

AIMD-Chig Dataset

The AIMD-Chig dataset consists of 2 million conformations of the 166-atom Chignolin, along with their corresponding potential energy and atomic forces calculated using Density Functional Theory (DFT) at the M06-2X/6-31G* level.

Read the article AIMD-Chig: Exploring the conformational space of a 166-atom protein Chignolin with ab initio molecular dynamics.
Find the story The first whole conformational molecular dynamics dataset for proteins at ab initio accuracy and the novel computational technologies behind it.
Get the dataset AIMD-Chig.

System Requirements

Hardware Requirements

The AI2BMD program runs on x86-64 GNU/Linux systems. We recommend a machine with the following specs:

CPU: 8+ cores
Memory: 32+ GB
GPU: CUDA-enabled GPU with 8+ GB memory

The program has been tested on the following GPUs:

A100
V100
RTX A6000
Titan RTX

Software Requirements

The program has been tested on the following systems:

OS: Ubuntu 20.04, Docker: 27.1
OS: ArchLinux, Docker: 26.1

Advanced Setup

Environment

The runtime libraries and requirents are packed into a Docker image for convenience and practicality. Before launching the Docker image, you need to install the Docker software (see https://docs.docker.com/engine/install/ for more details) and add the user to docker group with the following commands:

sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker

Protein File Preparation

The input file for AI2BMD should be .pdb format. If hydrogen atoms are missing in the .pdb file, hydrogens should be added. Then, the protein should be capped with ACE (acetyl) at the N-terminus and NME (N-methyl) at the C-terminus. These steps can be efficiently done using the PyMOL software with the following commands as a reference.

from pymol import cmd
pymol.finish_launching()
cmd.load("your_protein.pdb","molecule")
cmd.h_add("molecule") # Adding hydrogen

cmd.wizard("mutagenesis")
cmd.get_wizard().set_n_cap("acet")
selection = "/%s//%s/%s" % (molecule, chain, resi) #selection of N-term
cmd.get_wizard().do_select(selection)
cmd.get_wizard().apply()

cmd.get_wizard().set_c_cap("nmet")
selection = "/%s//%s/%s" % (molecule, chain, resi) #selection of N-term
cmd.get_wizard().do_select(selection)
cmd.get_wizard().apply()

cmd.set_wizard()

Next, you can use AmberTools' pdb4amber utility to adjust atom names in the .pdb file, specifically ensuring compatibility for ACE and NME as required by ai2bmd. The atom names for ACE and NME should conform to the following:

ACE: C, O, CH3, H1, H2, H3
NME: N, CH3, H, HH31, HH32, HH33

pdb4amber -i your_protein.pdb -o processed_your_protein.pdb

In addition, please verify that there are no TER separators in the protein chain. Additionally, the residue numbering should start from 1 without gaps.

After completing the above steps, your .pdb file should resemble the following format:

ATOM      1  H1  ACE     1      10.845   8.614   5.964  1.00  0.00           H
ATOM      2  CH3 ACE     1      10.143   9.373   5.620  1.00  0.00           C
ATOM      3  H2  ACE     1       9.425   9.446   6.437  1.00  0.00           H
ATOM      4  H3  ACE     1       9.643   9.085   4.695  1.00  0.00           H
ATOM      5  C   ACE     1      10.805  10.740   5.408  1.00  0.00           C
ATOM      6  O   ACE     1      10.682  11.417   4.442  1.00  0.00           O
...
ATOM    170  N   NME    12       9.499   8.258  10.367  1.00  0.00           N
ATOM    171  H   NME    12       9.393   8.028  11.345  1.00  0.00           H
ATOM    172  CH3 NME    12       8.845   7.223   9.569  1.00  0.00           C
ATOM    173 HH31 NME    12       7.842   6.990   9.925  1.00  0.00           H
ATOM    174 HH32 NME    12       8.798   7.589   8.543  1.00  0.00           H
ATOM    175 HH33 NME    12       9.418   6.305   9.435  1.00  0.00           H
END

You can also take the protein files in examples folder as reference. Note, currently, the machine learning potential doesn't support the protein with disulfide bonds well. We will update it soon.

Preprocess

During the preprocess, the solvated sytem is built and encounted energy minimization and alternative pre-equilibrium stages. Currently, AI2MD provides two methods for the preprocess via the argument preprocess_method.

If you choose the FF19SB method, the system will go through solvation, energy minimization, heating and several pre-equilibrium stages. To accelerate the preprocess by multiple CPU cores and GPUs, you should get AMBER software packages and modify the corresponding commands in src/AIMD/preprocess.py.

If you choose the AMOEBA method, the system will go through solvation and energy minimization stages. We highly recommend to perform pre-equilibrium simulations to let the simulation system fully relaxed.

Simulation

AI2BMD provides two modes for performing the production simulations via the argument mode. The default mode of fragment represents protein is fragmented into dipeptides and then calculated by the machine learning potential in every simulation step.

AI2BMD also supports to train the machine learning potential by yourselves and perform simulations without fragmentation. The visnet mode represents the potential energy and atomic forces of the protein are directly calculated by the ViSNet model as a whole molecule without fragmentation. When using this mode, you need to train ViSNet model with the data of the molecules by yourself, upload the model to src/ViSNet and give the corresponding value to the argument ckpt-type. In this way, you can use AI2BMD simulation program to simulate any kinds of molecules beyond proteins. To train the ViSNet model by yourselves, please check out the branch ViSNet for the source code, instructi

AI2BMD

Install / Use

README

AI<sup>2</sup>BMD: AI-powered ab initio biomolecular dynamics simulation

Contents