RabbitTClust

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches

Generate Convert Improve

Install / Use

/learn @RabbitBio/RabbitTClust

About this skill

Quality Score

0/100

README

RabbitTClust

`RabbitTClust v.2.4.0`

RabbitTClust is a fast and memory-efficient genome clustering tool based on sketch-based distance estimations. It enables processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. RabbitTClust supports classical single-linkage hierarchical (clust-mst), greedy incremental clustering (clust-greedy), and graph-based clustering (clust-leiden) algorithms for different scenarios.

Installation

RabbitTClust v.2.4.0 can only support 64-bit Linux Systems.

The detailed update information for this version, as well as the version history, can be found in the version_history document.

Install from bioconda

RabbitTClust is available from Bioconda.

Ensure that your machine supports at least AVX2 instructions.

Install from source code

Dependencies

cmake v.3.0 or later
c++14
zlib
igraph (optional, required for clust-leiden)

Compile and install

git clone --recursive https://github.com/RabbitBio/RabbitTClust.git
cd RabbitTClust
./install.sh

This will compile clust-mst and clust-greedy by default. If igraph is detected, clust-leiden will also be compiled.

Optional: Install igraph for clust-leiden

The clust-leiden module requires the igraph library. If igraph is not found during installation, you will see a warning message, but clust-mst and clust-greedy will still be available.

Option 1: Install via package manager (if available)

# Ubuntu/Debian
sudo apt-get install libigraph-dev

# macOS
brew install igraph

Option 2: Compile from source (recommended for CentOS/RHEL)

cd ~
wget https://github.com/igraph/igraph/releases/download/0.10.10/igraph-0.10.10.tar.gz
tar xzf igraph-0.10.10.tar.gz
cd igraph-0.10.10
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/local
make -j8 && make install

After installing igraph, return to the RabbitTClust directory and run ./install.sh again to compile clust-leiden.

Usage

# clust-mst, minimum-spanning-tree-based module for RabbitTClust
Usage: ./clust-mst [OPTIONS]
Options:
  -h,--help                   Print this help message and exit
  -t,--threads INT            set the thread number, default all CPUs of the platform
  -m,--min-length UINT        set the filter minimum length (minLen), genome length less than minLen will be ignore, default 10,000
  -c,--containment INT        use AAF distance with containment coefficient, set the containCompress, the sketch size is in proportion with 1/containCompress  -k,--kmer-size INT          set the kmer size
  -s,--sketch-size INT        set the sketch size for Jaccard Index and Mash distance, default 1000
  -l,--list                   input is genome list, one genome per line
  -e,--no-save                not save the intermediate files, such as sketches or MST
  -d,--threshold FLOAT        set the distance threshold for clustering
  -o,--output TEXT REQUIRED   set the output name of cluster result
  -i,--input TEXT Excludes: --append
                              set the input file, single FASTA genome file (without -l option) or genome list file (with -l option)
  --presketched TEXT          clustering by the pre-generated sketch files rather than genomes
  --premsted TEXT             clustering by the pre-generated mst files rather than genomes for clust-mst
  --newick-tree               output the newick tree format file for clust-mst
  --fast                      use the kssd algorithm for sketching and distance computing for clust-mst
  --dense                     optional: enable density/ANI stats and MST noise-removal pass (high memory; default is off)
  --dedup-dist FLOAT          within each cluster, collapse near-duplicate nodes connected by forest edges with dist <= dedup-dist; output to <output>.dedup
  --reps-per-cluster INT      select up to k representatives per cluster (after optional dedup); output to <output>.reps
  --append TEXT Excludes: --input
                              append genome file or file list with the pre-generated sketch or MST files

# clust-greedy, greedy incremental clustering module for RabbitTClust
Usage: ./clust-greedy [OPTIONS]
Options:
  -h,--help                   Print this help message and exit
  -t,--threads INT            set the thread number, default all CPUs of the platform
  -m,--min-length UINT        set the filter minimum length (minLen), genome length less than minLen will be ignore, default 10,000
  -c,--containment INT        use AAF distance with containment coefficient, set the containCompress, the sketch size is in proportion with 1/containCompress  -k,--kmer-size INT          set the kmer size
  -s,--sketch-size INT        set the sketch size for Jaccard Index and Mash distance, default 1000
  -l,--list                   input is genome list, one genome per line
  -e,--no-save                not save the intermediate files, such as sketches or MST
  -d,--threshold FLOAT        set the distance threshold for clustering
  -o,--output TEXT REQUIRED   set the output name of cluster result
  -i,--input TEXT Excludes: --append
                              set the input file, single FASTA genome file (without -l option) or genome list file (with -l option)
  --presketched TEXT          clustering by the pre-generated sketch files rather than genomes
  --append TEXT Excludes: --input
                              append genome file or file list with the pre-generated sketch or MST files
  --save-rep                  save representative inverted index for incremental clustering (note: may slightly affect performance)
  --dense                     optional: enable density/ANI-related MST post-processing (high memory; default is off; KSSD presketched path)

# clust-leiden, graph-based clustering module for RabbitTClust (requires igraph)
Usage: ./clust-leiden [OPTIONS]
Options:
  -h,--help                   Print this help message and exit
  -t,--threads INT            set the thread number, default all CPUs of the platform
  -m,--min-length UINT        set the filter minimum length (minLen), genome length less than minLen will be ignore, default 10,000
  -k,--kmer-size INT          set the kmer size
  -l,--list                   input is genome list, one genome per line
  -e,--no-save                not save the intermediate files, such as sketches
  -d,--threshold FLOAT        set the distance threshold for graph edge construction
  -o,--output TEXT REQUIRED   set the output name of cluster result
  -i,--input TEXT Excludes: --presketched
                              set the input file, single FASTA genome file (without -l option) or genome list file (with -l option)
  --presketched TEXT          clustering by the pre-generated sketch files rather than genomes
  --pregraph TEXT             clustering from pre-built graph (fast resolution adjustment without rebuilding graph)
  --fast                      use the kssd algorithm for sketching and distance computing (required)
  --resolution FLOAT          resolution parameter for clustering (higher = more clusters, default 1.0)
  --louvain                   use Louvain algorithm instead of Leiden (default: Leiden)
  --drlevel INT               set the dimension reduction level for Kssd sketches, default 3 with a dimension reduction of 1/4096

Example:

# input is a file list, one genome path per line:
./clust-mst -l -i bact_refseq.list -o bact_refseq.mst.clust
./clust-greedy -l -i bact_genbank.list -o bact_genbank.greedy.clust

# input is a single genome file in FASTA format, one genome as a sequence:
./clust-mst -i bacteria.fna -o bacteria.mst.clust
./clust-greedy -i bacteria.fna -o bacteria.greedy.clust

# the sketch size (reciprocal of sampling proportion), kmer size, and distance threshold can be specified by -s (-c), -k, and -d options.
./clust-mst -l -k 21 -s 1000 -d 0.05 -i bact_refseq.list -o bact_refseq.mst.clust
./clust-greedy -l -k 21 -c 1000 -d 0.05 -i bact_genbank.list -o bact_genbank.greedy.clust


# for redundancy detection with clust-greedy, input is a genome file list:
# use -d to specify the distance threshold corresponding to various degrees of redundancy.
./clust-greedy -d 0.001 -l -i bacteria.list -o bacteria.out

# v.2.1.0 or later
# for last running of clust-mst, it generated a folder name in year_month_day_hour-minute-second format, such as 2023_05_06_08-49-15.
# this folder contains the sketch, mst files.
# for generator cluster from exist MST with a distance threshold of 0.045:
./clust-mst -d 0.045 --premsted 2023_05_06_08-49-15/ -o bact_refseq.mst.d.045.clust
# for generator cluster from exist sketches files of clust-mst with a distance threshold of 0.045:
./clust-mst -d 0.045 --presketched 2023_05_06_08-49-15/ -o bact_refseq.mst.d.045.clust

# for generator cluster from exist sketches of clust-greedy with a distance threshold of 0.001:
# folder 2023_05_06_08-49-15 contains the sketch files.
./clust-greedy -d 0.001 --presketched 2023_05_06_09-37-23/ -o bact_genbank.greedy.d.001.clust

# v.2.2.0 or later
# for generator cluster from exist part sketches (presketch_A_dir) and append genome set (genome_B.list) to incrementally clustering 
./clust-mst --presketched

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。