SkDER
skDER & CiDDER: efficient & high-resolution dereplication of microbial genomes.
Install / Use
/learn @raufs/SkDERREADME
skDER (& CiDDER)
skDER (& CIDDER): efficient & high-resolution dereplication of microbial genomes to select representatives.
Warning: Please make sure to use version 1.0.7 or greater to avoid a bug in previous versions!
Contents
- Installation
- Overview
- Algorithmic Details
- Application examples & commands
- Alternative approaches and comparisons
- Test case
- Usage
- Citation notice
- Representative genomes for select bacterial taxa
<img src="https://raw.githubusercontent.com/raufs/skDER/main/images/Logo.png" alt="drawing" width="300"/> <img src="https://github.com/raufs/skDER/blob/main/images/Logo2.png" alt="drawing" width="223.5"/>
[!IMPORTANT] In v1.3.3, we introduced a
low_mem_greedyoption for low-memory dereplication for the top 20 or so taxa which are particularly well sequenced (e.g. those which have >10k or >20k genomes available). As we showed in the manuscript, while dereplication by skDER/CiDDER or other methods is typically not very memory-intensive when applied to an input set of <5,000 genomes, memory needs can expand when you go beyond this. Thelom_mem_greedymode was not included in the manuscript but for stats on how it compares please check out this wiki page. The quality of representatives selected were slightly lower and worse, for instance not as many distinct ortholog groups sampled per representative genome selected as the standardgreedyapproach. This is because it does not account for "connectivity" (aka "centrality") in prioritizing their selection, but it is considerably faster and more computationally efficient by leveraging skani'ssearchfunction through a greedy/iterative approach that prioritizes based on only N50. As an example, we were able to dereplicate >20,000 Staphylococcus from GTDB R220 in around 2.25 hours using 20 threads and ~1 GB of memory using the command:skder -t Staphylococcus -d greedy -c 20 -r R220 -o Staph_R220_skDER_LMG_Results/ -auto -d low_mem_greedy. For those interested in using this on their laptops, genomes can still add up in size, so make sure you have an appropriate amount of disk space available for the number of genomes you plan to dereplicate.
Installation
Bioconda
Note, (for some setups at least) it is critical to specify the conda-forge channel before the bioconda channel to properly configure priority and lead to a successful installation.
Recommended: For a significantly faster installation process, use mamba in place of conda in the below commands, by installing mamba in your base conda environment.
conda create -n skder_env -c conda-forge -c bioconda skder
conda activate skder_env
[!NOTE] 🍎 For Mac users with Apple Silicon chips, you might need to specify
CONDA_SUBDIR=osx-64prior toconda createas described here. So you would issue:CONDA_SUBDIR=osx-64 conda create -n skder_env -c conda-forge -c bioconda skder.
installation with mgecut (for removing MGEs prior to dereplication assessment)
To also use the option to prune out positions corresponding to MGEs using either PhiSpy or geNomad
conda create -n skder_env -c conda-forge -c bioconda skder genomad=1.8.0 phispy "keras>=2.7,<3.0" "tensorflow>=2.7,<2.16"
conda activate skder_env
[!TIP] geNomad is actively under development and new databases and versions are being released intermittendly. In the above code, we are suggesting installing v1.8.0 because it has been tested to work with database version 1.5, which we suggest downloading below. These are the versions we used for the skDER/CiDDER manuscript. For the most up to date installation of geNomad and its database possible, please consult its git repo. If there are issues with installation, please first try to install geNomad by itself in an individual conda repo. If successful, which suggests the issue is due to conflicts that have risen between geNomad and skDER/CiDDER dependencies, then please just let us know via a GitHub issue and we will attempt to resolve the issue.
Docker
Download the bash wrapper script to simplify usage for skDER or CiDDER:
# download the skDER wrapper script and make it executable
wget https://raw.githubusercontent.com/raufs/skDER/refs/heads/main/Docker/skDER/run_skder.sh
chmod a+x run_skder.sh
# or download the CiDDER wrapper script and make it executable
wget https://raw.githubusercontent.com/raufs/skDER/refs/heads/main/Docker/CiDDER/run_cidder.sh
chmod a+x run_cidder.sh
# test it out!
./run_skder.sh -h
./run_cidder.sh -h
Optionally, if you are interested in filtering MGEs using geNomad, download the relevant databases:
wget https://zenodo.org/records/8339387/files/genomad_db_v1.5.tar.gz?download=1
mv genomad_db_v1.5* genomad_db_v1.5.tar.gz
tar -zxvf genomad_db_v1.5.tar.gz
Conda Manual
# 1. clone Git repo and change directories into it!
git clone https://github.com/raufs/skDER/
cd skDER/
# 2. create conda environment using yaml file and activate it!
conda env create -f skDER_env.yml -n skDER_env
conda activate skDER_env
# 3. complete python installation with the following command:
pip install -e .
Overview
skDER
skDER will perform dereplication of genomes using skani average nucleotide identity (ANI) and aligned fraction (AF) estimates and either a dynamic programming or greedy-based based approach. It assesses such pairwise ANI & AF estimates to determine whether two genomes are similar to each other and then chooses which genome is better suited to serve as a representative based on assembly N50 (favoring the more contiguous assembly) and connectedness (favoring genomes deemed similar to a greater number of alternate genomes).
Compared to dRep by Olm et al. 2017 and galah, skDER does not use a divide-and-conquer approach based on primary clustering with a less-accurate ANI estimator (e.g. MASH or dashing) followed by greedy clustering/dereplication based on more precise ANI estimates (for instance computed using FastANI) in a secondary round. skDER instead leverages advances in accurate yet speedy ANI calculations by skani by Shaw and Yu to simply take a "one-round" approach (albeit skani triangle itself uses a preliminary 80% ANI cutoff based on k-mer sketches, which we by default increase to 90% in skDER). skDER is also primarily designed for selecting distinct genomes for a taxonomic group for comparative genomics rather than for metagenomic application.
skDER, specifically the "dynamic programming" based approach, can still be used for metagenomic applications if users are cautious and filter out MAGs or individual contigs which have high levels of contamination, which can be assessed using CheckM or charcoal. To support this application with the realization that most MAGs likely suffer from incompleteness, we have introduced a parameter/cutoff for the max alignment fraction difference for each pair of genomes. For example, if the AF for genome 1 to genome 2 is 95% (95% of genome 1 is contained in genome 2) and the AF for genome 2 to genome 1 is 80%, then the difference is 15%. Because the default value for the difference cutoff is 10%, in that example the genome with the larger value will automatically be regarded as redundant and become disqualified as a potential representative genome.
skDER features three distinct algorithms for dereplication (details can be found below):
- dynamic approach: approximates selection of a single representative genome per transitive cluster - results in a concise listing of representative genomes - well suited for metagenomic applications.
- greedy approach: performs selection based on greedy set cover type approach - better suited to more comprehensively select representative genomes and sample more of a taxon's pangenome [current default].
- greedy low-memory approach: performs selection iteratively using a greedy set cover type approach where genomes chosen as representatives are prioritized soley based on N50. Should result in lower-quality representative selections compared to the standard greedy mode, which also prioritizes genomes based on connectivity, but should be more more memory-efficient.
[!NOTE] The skDER "greedy" algorithm is just referring to the selection algorithm - all vs all assessment is performed using skani triangle still. This is in contrast to dRep or galah where "greedy" is referring to their iterative process of selecting a representative gen
Related Skills
node-connect
344.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
96.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
