Folddisco
Fast indexing and search of discontinuous motifs in protein structures
Install / Use
/learn @steineggerlab/FolddiscoREADME
Folddisco
Folddisco is tool for searching discontinuous motifs in protein structures. It is designed to handle large-scale protein databases with efficiency, enabling the detection of structural motifs across thousands of proteomes or millions of structures.
Publications
Webserver
Search protein structures motifs against the AlphaFoldDB and PDB in seconds using the Folddisco webserver (code): search.foldseek.com/folddisco 🚀
Installation
# Install from Bioconda
conda create -n folddisco -c conda-forge -c bioconda folddisco
# Install through docker
docker pull ghcr.io/steineggerlab/folddisco:master
# Precompiled binary for Linux x86-64
wget https://mmseqs.com/folddisco/folddisco-linux-x86_64.tar.gz; tar xvfz folddisco-linux-x86_64.tar.gz; export PATH=$(pwd)/folddisco/bin/:$PATH
# Precompiled binary for Linux ARM64
wget https://mmseqs.com/folddisco/folddisco-linux-arm64.tar.gz; tar xvfz folddisco-linux-arm64.tar.gz; export PATH=$(pwd)/folddisco/bin/:$PATH
# macOS (universal, works on Apple Silicon and Intel Macs)
wget https://mmseqs.com/folddisco/folddisco-macos-universal.tar.gz; tar xvfz folddisco-macos-universal.tar.gz; export PATH=$(pwd)/folddisco/bin/:$PATH
Compile from source
Compiling from source requires the Rust toolchain (Cargo). Installation instructions are available here.
git clone https://github.com/steineggerlab/folddisco.git
cd folddisco
cargo install --features foldcomp --path .
Quick start
Folddisco queries a database of precomputed geometric hashes computed from structures.
Download pre-build database
You can download the pre-built human proteome index and use it to search for a common motif, like a zinc finger.
This example is fully self-contained. You can copy and paste the entire block into your terminal.
# Download human proteome index. Use wget or aria2 to download the index.
cd index
aria2c https://opendata.mmseqs.org/folddisco/h_sapiens_folddisco.tar.lz4
# Extract the index
lz4 -dc h_sapiens_folddisco.tar.lz4 | tar -xvf -
cd ..
Pre-built Indices
Download pre-built index files:
- Human proteome
- E. coli proteome
- AFDB proteome of 16 model organisms
- Swiss-Prot
- AFDB50
- ESM30
- To get the old version of Folddisco indices, please visit https://opendata.mmseqs.org/folddisco/
*.tar.gzindices are legacy indices (version 1.0), which are not compatible with version 2.0. Please use*.tar.lz4indices for version 2.0.- AFDB50 (
afdb50_v4_folddisco*+afdb50_v4*) - ESM30 (
highquality_clust30_folddisco*+highquality_clust30*)
Build an custom index
The command below will read all PDB or mmCIF from serine_peptidases folder and generate an index serine_peptidases_folddisco.
folddisco index -p data/serine_peptidases -i index/serine_peptidases_folddisco
Querying a Single Motif
To search for a specific structural motif, you'll use three main flags:
-p: Provides the query protein's structure file (PDB/mmCIF).-q: Specifies the comma-separated list of residues that form your motif.-i: Points to the target database index you want to search against.
If you omit the -q flag, folddisco defaults to a "whole structure" search. It will find all possible motifs from your entire query protein and search for them in the index.
# Search for the catalytic triad from 4CHA.pdb against the indexed peptidases.
folddisco query -i index/serine_peptidases_folddisco -p query/4CHA.pdb -q B57,B102,C195
Residue & motif syntax
We allow to customize the query motif using some motif syntax.
- Residues:
B57= chainB, residue number57. Ranges are inclusive:1-10. - Lists: comma-separated:
B57,B102,C195. - Substitutions:
:<ALT>allows alternatives:- Single amino acid:
164:H - Set:
247:ND(Asp or Asn) - Wildcard/categories:
X: any amino acidp: positively charged (Arg, His, Lys)n: negatively charged (Asp, Glu)h: polar (Asn, Gln, Ser, Thr, Tyr)b: hydrophobic (Ala, Cys, Gly, Ile, Leu, Met, Phe, Pro, Val)a: aromatic (His, Phe, Trp, Ty)
- Single amino acid:
Searching Multiple Motifs (Batch Mode)
To search for many motifs at once, you can provide a single query file to the -q flag (and omit the -p flag).
This file must be a tab-separated text file with these columns:
- Column 1: Path to the query structure (PDB/mmCIF).
- Column 2: Comma-separated list of motif residues.
- Column 3: (Optional) path to the output file (default:
stdout).
# Search a zinc finger motif against pre-downloaded human proteome (see Download pre-build database)
folddisco query -i index/h_sapiens_folddisco -q query/serine_peptidase.txt
Commands
Usage of Query Module
folddisco query -i <INDEX> -p <QUERY_PDB> [-q <QUERY_RESIDUES> -d <DISTANCE_THRESHOLD> -a <ANGLE_THRESHOLD> --skip-match -t <THREADS>]
Important parameter:
-d: Distance threshold in Å increase sensitivity during the prefilter (default: 0.5)-a: Angle threshold in degrees, increase sensitivity during the prefilter (default: 5)--skip-match: Skips residue matching and RMSD calculation (prefilter only, much faster with same ranking)--top: Only report top N hits from the prefilter (controls speed and size of result)-t: Threads used for search-v: Verbose output
Example Querying
# Search with default settings. This will print out matching motifs with sorting by RMSD.
folddisco query -p query/4CHA.pdb -q B57,B102,C195 -i index/h_sapiens_folddisco -t 6
folddisco query -p query/1G2F.pdb -q F207,F212,F225,F229 -i index/h_sapiens_folddisco -d 0.5 -a 5 -t 6
folddisco query -p query/1LAP.pdb -q 250,255,273,332,334 -i index/h_sapiens_folddisco --skip-match -t 6 # Skip residue matching
# Query file given as separate text file
folddisco query -q query/zinc_finger.txt -i index/h_sapiens_folddisco -t 6 -d 0.5 -a 5
# Querying a whole structure
folddisco query -i index/h_sapiens_folddisco -p query/1G2F.pdb -t 6 --skip-match
# For a long query, low `--sampling-ratio` can be used to speed up the search
folddisco query -i index/h_sapiens_folddisco -p query/1G2F.pdb -t 6 --skip-match --sampling-ratio 0.3
# Using a query file with distance and angle thresholds
folddisco query -i index/h_sapiens_folddisco -q query/knottin.txt -d 0.5 -a 5 --skip-match -t 6
# Query with amino-acid substitutions and range.
# Alternative amino acids can be given after colon.
# X: substitute to any amino acid, p: positive-charged, n: negative-charged, h: hydrophilic, b: hydrophobic, a: aromatic
# Here's enolase query with 3 substitutions; Allow His at 164, Asp & Asn at 247, and His at 297. (Install e_coli_folddisco index first)
folddisco query -p query/2MNR.pdb -q 164:H,195,221,247:ND,297:H -i index/e_coli_folddisco -d 0.5 -a 5 --top 10 --header --per-structure
# Range can be given with dash. This will query first 10 residues and 11th residue with subsitution to any amino acid.
folddisco query -p query/4CHA.pdb -q 1-10,11:X -i index/h_sapiens_folddisco -t 6 --serial-index
# Advanced query with filtering and sorting
## Based on connected node and rmsd
folddisco query -q query/zinc_finger.txt -i index/h_sapiens_folddisco -t 6 --connected-node 0.75 --rmsd 1.0
## Coverage based filtering & top N filtering without residue matching
folddisco query -q query/zinc_finger.txt -i index/h_sapiens_folddisco -t 6 --covered-node 3 --top 1000 --per-structure --skip-match
# Print top 100 structures with sorting by score
folddisco query -p query/4CHA.pdb -q B57,B102,C195 -i index/h_sapiens_folddisco -t 6 --top 100 --per-structure --sort-by-score
folddisco query -q query/zinc_finger.txt -i index/h_sapiens_folddisco -t 6 --covered-node 4 --top 100 --sort-by-score --per-structure --skip-match
# Comprehensive filtering with multiple criteria
folddisco query -q query/zinc_finger.txt -i index/h_sapiens_folddisco -t 6 -d 0.5 -a 10.0 --ca-distance 1.0 --covered-node-ratio 0.3 --max-node-ratio 0.35 --rmsd 5.0 --tm-score 0.2 --gdt-ts 0.25 --gdt-ha 0.15 --chamfer-distance 5.5 --hausdorff-distance 12.0 --sort-by node_count,gdt_ts,rmsd,idf --format-output tid,node_count,gdt_ts,rmsd,idf,matching_residues,query_residues
Indexing
Usage of Index Module
folddisco index -p <PDB_DIR|FOLDCOMP_DB> -i <INDEX_PATH> -t <THREADS> [-d <DISTANCE_BINS> -a <ANGLE_BINS> -y <FEATURE_TYPE>]
Important parameter:
-d: Distance threshold in Å for pairs to be included (default: 16)-a: Bin size of Angle (default: 4)-m: For big databases (>65k structures) enable -m big for efficiency. Modebig, generates an 8GB fixed-size offset.-t: Threads used for search-v: Verbose output--type: Define which features se
