The 3DFI pipeline automates protein structure predictions, structural homology searches and alignments with putative structural homologs at the genome scale. Protein structures predicted in PDB format are searched against a local copy of the RSCB PDB database with Foldseek or GESAMT (General Efficient Structural Alignment of Macromolecular Targets) from the CCP4 package. Known PDB structures can also be searched against sets of predicted structures to identify potential structural homologs in predicted datasets. These structural homologs are then aligned for visual inspection with ChimeraX.

<hr size="8" width="100%"> <details open> <summary>Show/hide section: TOC</summary>

Introduction
Getting started
The 3DFI pipeline process in detail
Miscellaneous
- Useful scripts
- Alternate predictors
  - trRosetta
  - trRosetta2
Funding and acknowledgments
How to cite
References

</details> <hr size="8" width="100%"> <details open> <summary>Show/hide section: Introduction</summary>

Introduction

About function inferences

Inferring the function of proteins using computational approaches usually involves performing some sort of homology search based on sequences or structures. In sequence-based searches, nucleotide or amino acid sequences are queried against known proteins or motifs using tools such as BLAST, DIAMOND, HHBLITS or HMMER, but those searches may fail if the proteins investigated are highly divergent. In structure-based searches, proteins are searched instead at the 3D level for structural homologs.

Why structural homologs?

Because structure often confers function in biology, structural homologs often share similar functions, even if the building blocks are not the same (i.e. a wheel made of wood or steel is still a wheel regardless of its composition). Using this approach, we might be able to infer putative functions for proteins that share little to no similarity at the sequence level with known proteins, assuming that a structural match can be found.

What is needed for structure-based homology searches?

To perform structure-based predictions we need 3D structures — either determined experimentally or predicted computationally — that we can query against other structures, such as those from the RCSB PDB. We also need tools that can search for homology at the structural level. Several tools are now available to predict protein structures, many of which are implemented as web servers for ease of use. A listing can be found at CAMEO, a website that evaluates their accuracy and reliability. Fewer tools are available to perform searches at the 3D levels (e.g. Foldseek, GESAMT, SSM). Foldseek is available as a standalone program, GESAMT is distributed as part of the CCP4 package, and SSM is implemented in PDBeFold.

Why this pipeline?

Although predicting the structure of a protein and searching for structural homologs can be done online, for example by using SWISS-MODEL and PDBeFold, genomes often code for thousands of proteins and applying this approach on a genome scale using web portals would be time consuming and error prone. We implemented the 3DFI pipeline to enable the use of structure-based homology searches at a genome-wide level from the command line.

</details> <hr size="8" width="100%"> <details open> <summary>Show/hide section: Getting started</summary>

Getting started

Recommended hardware

The 3DFI pipeline was tested on Fedora 33/34 Linux workstations (Workstation 1 - AMD Ryzen 5950X, NVIDIA RTX A6000, 128 Gb RAM; Workstation 2 - AMD Ryzen 3900X, NVIDIA RTX 2070S, 64 Gb RAM; Workstation 3 - 2x Intel Xeon E5-2640, NVIDIA GTX 1070, 128 Gb RAM).

The following hardware is recommended to use 3DFI:

A CUDA-enabled NVIDIA GPU (>= 24 Gb VRAM; >= 6.1 compute capability)
A fast 4 Tb+ SSD
At least 64 Gb of RAM

The deep-learning based protein structure predictors AlphaFold2 and RoseTTAFold leverage the NVIDIA CUDA framework to accelerate computations on existing GPUs. Although small proteins might fit within 8Gb of video RAM (VRAM), larger proteins will require more VRAM (the RoseTTAFold authors used a 24 Gb VRAM GPU in their paper). Both AlphaFold2 and RoseTTAFold can run without GPU acceleration, but doing so is much slower and is only recommended for small numbers of proteins. The template-based protein structure predictor RaptorX does not require any GPU.
Both AlphaFold2 and RoseTTAFold leverage hhblits from HH-suite3 to perform hidden Markov model searches as part of their protein structure prediction processes. These searches are I/O intensive and can be greatly sped up by putting the databases to query onto an NVME SSD. Because the Alphafold databases are over 2.2 Tb in size once uncompressed, a fast SSD of at least 4TB is recommended to store all databases in a single location. Running hhblits on hard drives is possible (if slower), but we have seen hhblits searches crash on a few occasions when an hard drive's I/O was being saturated.
Using AlphaFold2 with its --full_dbs preset can require a large amount of system memory. The AlphaFold --reduced_dbs preset uses a smaller memory footprint. 3DFI was tested on machines with a minimum of 64 Gb of RAM but may work on machines with more modest specifications.
If investigating large datasets, a 10 Tb+ storage solution is recommended (hard drives are fine) to store the results. AlphaFold often outputs over 50 Gb of data per protein.

Software requirements

The 3DFI pipeline requires the following software to perform protein structure predictions, structural homology searches/alignments and visualization:

The lightweight aria2 download utility tool.
At least one of the following protein structure prediction tools:
- A customized version of AlphaFold2 (Deep-learning-based)
  - Requires Docker
- RoseTTAFold (Deep-learning-based)
  - Requires Conda, PyRosetta (Python-3.7 Release)
- RaptorX (Template-based)
  - Requires MODELLER
A structural homology search tool:
- Foldseek
- GESAMT via CCP4
An alignment/visualization tool:
- ChimeraX 1.3+
Perl5 and the additonal scripting module:
- [PerlIO

3DFI

Install / Use

README

Table of contents