SkillAgentSearch skills...

3DFI

The 3DFI pipeline predicts the 3D structure of proteins and searches for structural homology in the 3D space.

Install / Use

/learn @PombertLab/3DFI
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="left"><img src="https://github.com/PombertLab/3DFI/blob/master/Images/Logo.png" alt="3DFI - Three-dimensional function inference" width="800"></p>

The 3DFI pipeline automates protein structure predictions, structural homology searches and alignments with putative structural homologs at the genome scale. Protein structures predicted in PDB format are searched against a local copy of the RSCB PDB database with Foldseek or GESAMT (General Efficient Structural Alignment of Macromolecular Targets) from the CCP4 package. Known PDB structures can also be searched against sets of predicted structures to identify potential structural homologs in predicted datasets. These structural homologs are then aligned for visual inspection with ChimeraX.

DOI

<hr size="8" width="100%"> <details open> <summary><b><i>Show/hide section: TOC</i></b></summary>

Table of contents

</details> <hr size="8" width="100%"> <details open> <summary><b><i>Show/hide section: Introduction</i></b></summary>

Introduction

About function inferences

Inferring the function of proteins using computational approaches usually involves performing some sort of homology search based on sequences or structures. In sequence-based searches, nucleotide or amino acid sequences are queried against known proteins or motifs using tools such as BLAST, DIAMOND, HHBLITS or HMMER, but those searches may fail if the proteins investigated are highly divergent. In structure-based searches, proteins are searched instead at the 3D level for structural homologs.

Why structural homologs?

Because structure often confers function in biology, structural homologs often share similar functions, even if the building blocks are not the same (i.e. a wheel made of wood or steel is still a wheel regardless of its composition). Using this approach, we might be able to infer putative functions for proteins that share little to no similarity at the sequence level with known proteins, assuming that a structural match can be found.

What is needed for structure-based homology searches?

To perform structure-based predictions we need 3D structures — either determined experimentally or predicted computationally — that we can query against other structures, such as those from the RCSB PDB. We also need tools that can search for homology at the structural level. Several tools are now available to predict protein structures, many of which are implemented as web servers for ease of use. A listing can be found at CAMEO, a website that evaluates their accuracy and reliability. Fewer tools are available to perform searches at the 3D levels (e.g. Foldseek, GESAMT, SSM). Foldseek is available as a standalone program, GESAMT is distributed as part of the CCP4 package, and SSM is implemented in PDBeFold.

Why this pipeline?

Although predicting the structure of a protein and searching for structural homologs can be done online, for example by using SWISS-MODEL and PDBeFold, genomes often code for thousands of proteins and applying this approach on a genome scale using web portals would be time consuming and error prone. We implemented the 3DFI pipeline to enable the use of structure-based homology searches at a genome-wide level from the command line.

</details> <hr size="8" width="100%"> <details open> <summary><b><i>Show/hide section: Getting started</i></b></summary>

Getting started

Recommended hardware

The 3DFI pipeline was tested on Fedora 33/34 Linux workstations (Workstation 1 - AMD Ryzen 5950X, NVIDIA RTX A6000, 128 Gb RAM; Workstation 2 - AMD Ryzen 3900X, NVIDIA RTX 2070S, 64 Gb RAM; Workstation 3 - 2x Intel Xeon E5-2640, NVIDIA GTX 1070, 128 Gb RAM).

The following hardware is recommended to use 3DFI:

  • A CUDA-enabled NVIDIA GPU (>= 24 Gb VRAM; >= 6.1 compute capability)
  • A fast 4 Tb+ SSD
  • At least 64 Gb of RAM
  1. The deep-learning based protein structure predictors AlphaFold2 and RoseTTAFold leverage the NVIDIA CUDA framework to accelerate computations on existing GPUs. Although small proteins might fit within 8Gb of video RAM (VRAM), larger proteins will require more VRAM (the RoseTTAFold authors used a 24 Gb VRAM GPU in their paper). Both AlphaFold2 and RoseTTAFold can run without GPU acceleration, but doing so is much slower and is only recommended for small numbers of proteins. The template-based protein structure predictor RaptorX does not require any GPU.

  2. Both AlphaFold2 and RoseTTAFold leverage hhblits from HH-suite3 to perform hidden Markov model searches as part of their protein structure prediction processes. These searches are I/O intensive and can be greatly sped up by putting the databases to query onto an NVME SSD. Because the Alphafold databases are over 2.2 Tb in size once uncompressed, a fast SSD of at least 4TB is recommended to store all databases in a single location. Running hhblits on hard drives is possible (if slower), but we have seen hhblits searches crash on a few occasions when an hard drive's I/O was being saturated.

  3. Using AlphaFold2 with its --full_dbs preset can require a large amount of system memory. The AlphaFold --reduced_dbs preset uses a smaller memory footprint. 3DFI was tested on machines with a minimum of 64 Gb of RAM but may work on machines with more modest specifications.

  4. If investigating large datasets, a 10 Tb+ storage solution is recommended (hard drives are fine) to store the results. AlphaFold often outputs over 50 Gb of data per protein.

Software requirements

The 3DFI pipeline requires the following software to perform protein structure predictions, structural homology searches/alignments and visualization:

  1. The lightweight aria2 download utility tool.
  2. At least one of the following protein structure prediction tools:
  3. A structural homology search tool:
  4. An alignment/visualization tool:
  5. Perl5 and the additonal scripting module:
    • [PerlIO
View on GitHub
GitHub Stars20
CategoryDevelopment
Updated7mo ago
Forks7

Languages

Perl

Security Score

87/100

Audited on Aug 7, 2025

No findings