Predicting and modeling protein complexes with deep learning

Accurate descriptions of protein-protein interactions are essential for understanding biological systems. Can we predict protein-protein interactions given an arbitrary pair of protein sequences, and more generally, can we identify higher order protein complexes among an arbitrary number of protein sequences? AF2Complex was born to address this question by taking advantage of AlphaFold, sophisticated neural network models originally designed for predicting structural models of single protein sequences by DeepMind. We extended it not only to model known protein-protein interactions, but also to predict possible interactions among multiple proteins by using the confidence of the predicted structural models. The approach can be applied to challening senarios such as transient interactions of membrane proteins that are difficult to capture experimentally.

In a nutshell, AF2Complex is an enhanced version of AlphaFold with many features useful for real-world scenarios involving protein complexes. Its initial development is based on AlphaFold version v2.0.1, released by DeepMind in July 2021. After DeepMind released AlphaFold-Multimer version v2.1.1 in November 2021, AF2Complex has been updated to support the multimer models released later on. Details of our initial development, including large-scale performance evaluations and exemplary applications, have been described in this publication.

In a follow-up work, we further demonstrate how to use AF2complex to conduct a large-scale virutal screening to discover novel protein-protein interactions. Using E. coli envelopome (all proteins within the cell envelope) as the screening library, AF2Complex was applied to proteins from the outer membrane biogenesis pathway. Unexpected protein-protein interactions with profound implication were revealed.

You may test examples of AF2Complex or explore protein-protein interactions within the E. coli proteome in Google Cloud using this colab notebook.

Updates and Features

Version 1.4.1 (2024-04-08)

Option to pair MSA by top N per species or organism
Various experimental options and improvements

Version 1.4.0 (2023-01-30)

Support AF-Multimer v3 multimer models (based on AF v2.3.1)
More options for input feature generation

Version 1.3.0 (2022-08-30)

Google Colab notebook access
Added multiple MSA pairing options
Enhanced interface score assessment
Clustering predicted complexes

Version 1.2.2 (2022-03-10)

Minor bug fix
Support AF-Multimer v2 neural network models (AF v2.2.0)

Version 1.2.0 (2022-02-19)

Added support to AF-Multimer neural network models (AF v2.1.0) in both paired and unpaired MSAs modes
Domain cropping pre-generated full monomer features (in unpaired MSAs mode only)
Checkpoint option for model inference
Refactored code

Version 1.0.0 (2021-11-09)

Predicting structural models of a protein complex
Paired MSAs not required for complex modeling
Metrics for evaluating structural models of protein-protein interfaces
Option to save the intermediate models during recycles
Added genome, super, economy presets
Modularized workflow including feature generation, DL model inference and MD minimization

Installation

The latest package has essentially the same software dependency and hardware requirement as AlphaFold version v2.3.2. If you have installed it, only this package and an additional python module (networkx) is required. If you have not installed AF2, please follow its official installation guide of AlphaFold 2 first. Note that if you just want to evaluate the examples we provided, you do not need to install any sequence library or third-party sequence searching tools as the input features of the examples have been prepared for you. After resolving all python dependency required, you are (almost) good to go.

You also need the AlphaFold deep neural network models trained by DeepMind. Running this package requires the DL models with the TM-score prediction capability (i.e, monomer_ptm or multimer models). In AF's releases, these models are named as params_model_x_ptm.npz (AF version 2.0.x), params_model_x_multimer.npz (AF version 2.1.x), or params_model_x_multimer_v2.npz (AF version 2.2.x), or params_model_x_multimer_v3.npz (AF version 2.3.x). The examples we provided uses the original monomer_ptm and the most recent multimer_v3 models, but you could use other version of multimer models as well. The installation of AlphaFold computing environment could take hours, dependent on whether you choose to do a full installation that includes all sequence libraries. Downloading these sequence and PDB template libraries are time-consuming.

After you have set up AlphaFold 2, simply clone this repository by

git clone https://github.com/FreshAirTonight/af2complex

and follow the guide to run the demo examples.

Examples

Under the "example" directory, there are three CASP14 multimeric targets, H1065, H1072, and H1060v4. The goal is to predict the structures for these target complexes, one heterodimer (A1:B1), one heterotetramer (A2:B2), and another challenging homo-dodecamer (A12). For your convenience, the input features to deep learning models are provided so that you can perform the structure prediction directly. Please follow a detailed instruction on how to test these examples.

Feature generation

We provide the pre-generated features for our benchmark data sets, and for the E. coli proteome (~4,400 proteins) at Zenodo. For AF2Complex v1.3.0 and above, a new set of input features for E. coli have been generated so that one can use multiple MSA pairing mode. We recommend that you check out this new feature library.

If you apply this package to a new target. The first step is to generate input features. For the purpose of efficient computing, we have created a staged AF2 workflow and provide a script run_af2c_fea.py for the stage of feature generation. This script will output features (in python pickle format) for an individual protein sequence. Multiple options for feature generation are provided, including both the original monomer data pipeline and modified pipelines that add species information for MSA pairing later, and search the full PDB library for templates. These input features of individual monomers are then used in the next, model inference stage. For a complex target, AF2Complex assembles its monomeric components for model prediction. In this way, the input features of monomers can be re-used for predicting many combinations of protein-protein interations. Please check out this guide for options and an example run script run_fea_gen.sh.

AF2Complex version 1.3.0 and above supports reading regular or gzipped feature pickle files directly. This reduces the storage requirement for high-throughput PPI virtual screening.

Target syntax

After collecting the input features of monomers, you may predict a complex structure using the script run_af2c_mod.py, which runs through deep learning model inference. The stoichiometry of your target, be it a monomer or a complex, is defined in an input list file. In the examples we provided, the target list files are under subdirectory targets. The general format of one target is defined like the follows,

A:2/B:2/C/D/E <total_length> <output_name>

where the first column defines the stoichiometry of the complex, e.g., A:2/B:2/C/D/E, using the IDs of the individual sequences, :<num> after each protein defines its homo copy number, and / to separate distinct monomers. The IDs of monomers are also used as the name of sub-directory for locating their input features. The second column, <total_length>, is the total number of amino acids of the putative complex. The length is parsed but not used by the model inference python script, it is intended for a job scheduler during batch job submission on a computing cluster. The third column, <output_name>, is the name of the output sub-directory for predicted structural models.

In the example above, the target complex is made of five protein sequences named A to E, and protein A and B each have two copies. During a prediction, the program will look for individual input features of A to E under the input feature directory, e.g, $input_dir/A/features.pkl(.gz), and then assemble them into the features for complex structure prediction. If you provide only a single protein without a copy number, e.g., A <seq_length>, it reverts to a structural prediction of a single protein A.

A more advanced example that restricts modeling to certain domains within a sequence is like the follows

A|19-200;500-700:2/B/C 1788 A2BC

where the residue ranges, 19 to 200 and 500 to 700, are taken out from A's full length input features for modeling A2BC, composed of two copies of A, single copy of B and C, and with a total size of 1788 AAs. The domain modeling capability allows conveniently modeling parts of a large sequence, and also avoid possible errors caused by using a pa

Af2complex

Install / Use

README