SkillAgentSearch skills...

RabbitTClust

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches

Install / Use

/learn @RabbitBio/RabbitTClust
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

install with conda install with conda install with conda install with conda

RabbitTClust

RabbitTClust v.2.4.0

RabbitTClust is a fast and memory-efficient genome clustering tool based on sketch-based distance estimations. It enables processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. RabbitTClust supports classical single-linkage hierarchical (clust-mst), greedy incremental clustering (clust-greedy), and graph-based clustering (clust-leiden) algorithms for different scenarios.

Installation

RabbitTClust v.2.4.0 can only support 64-bit Linux Systems.

The detailed update information for this version, as well as the version history, can be found in the version_history document.

Install from bioconda

RabbitTClust is available from Bioconda.

Ensure that your machine supports at least AVX2 instructions.

Install from source code

Dependencies

  • cmake v.3.0 or later
  • c++14
  • zlib
  • igraph (optional, required for clust-leiden)

Compile and install

git clone --recursive https://github.com/RabbitBio/RabbitTClust.git
cd RabbitTClust
./install.sh

This will compile clust-mst and clust-greedy by default. If igraph is detected, clust-leiden will also be compiled.

Optional: Install igraph for clust-leiden

The clust-leiden module requires the igraph library. If igraph is not found during installation, you will see a warning message, but clust-mst and clust-greedy will still be available.

Option 1: Install via package manager (if available)

# Ubuntu/Debian
sudo apt-get install libigraph-dev

# macOS
brew install igraph

Option 2: Compile from source (recommended for CentOS/RHEL)

cd ~
wget https://github.com/igraph/igraph/releases/download/0.10.10/igraph-0.10.10.tar.gz
tar xzf igraph-0.10.10.tar.gz
cd igraph-0.10.10
mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/local
make -j8 && make install

After installing igraph, return to the RabbitTClust directory and run ./install.sh again to compile clust-leiden.

Usage

# clust-mst, minimum-spanning-tree-based module for RabbitTClust
Usage: ./clust-mst [OPTIONS]
Options:
  -h,--help                   Print this help message and exit
  -t,--threads INT            set the thread number, default all CPUs of the platform
  -m,--min-length UINT        set the filter minimum length (minLen), genome length less than minLen will be ignore, default 10,000
  -c,--containment INT        use AAF distance with containment coefficient, set the containCompress, the sketch size is in proportion with 1/containCompress  -k,--kmer-size INT          set the kmer size
  -s,--sketch-size INT        set the sketch size for Jaccard Index and Mash distance, default 1000
  -l,--list                   input is genome list, one genome per line
  -e,--no-save                not save the intermediate files, such as sketches or MST
  -d,--threshold FLOAT        set the distance threshold for clustering
  -o,--output TEXT REQUIRED   set the output name of cluster result
  -i,--input TEXT Excludes: --append
                              set the input file, single FASTA genome file (without -l option) or genome list file (with -l option)
  --presketched TEXT          clustering by the pre-generated sketch files rather than genomes
  --premsted TEXT             clustering by the pre-generated mst files rather than genomes for clust-mst
  --newick-tree               output the newick tree format file for clust-mst
  --fast                      use the kssd algorithm for sketching and distance computing for clust-mst
  --dense                     optional: enable density/ANI stats and MST noise-removal pass (high memory; default is off)
  --dedup-dist FLOAT          within each cluster, collapse near-duplicate nodes connected by forest edges with dist <= dedup-dist; output to <output>.dedup
  --reps-per-cluster INT      select up to k representatives per cluster (after optional dedup); output to <output>.reps
  --append TEXT Excludes: --input
                              append genome file or file list with the pre-generated sketch or MST files

# clust-greedy, greedy incremental clustering module for RabbitTClust
Usage: ./clust-greedy [OPTIONS]
Options:
  -h,--help                   Print this help message and exit
  -t,--threads INT            set the thread number, default all CPUs of the platform
  -m,--min-length UINT        set the filter minimum length (minLen), genome length less than minLen will be ignore, default 10,000
  -c,--containment INT        use AAF distance with containment coefficient, set the containCompress, the sketch size is in proportion with 1/containCompress  -k,--kmer-size INT          set the kmer size
  -s,--sketch-size INT        set the sketch size for Jaccard Index and Mash distance, default 1000
  -l,--list                   input is genome list, one genome per line
  -e,--no-save                not save the intermediate files, such as sketches or MST
  -d,--threshold FLOAT        set the distance threshold for clustering
  -o,--output TEXT REQUIRED   set the output name of cluster result
  -i,--input TEXT Excludes: --append
                              set the input file, single FASTA genome file (without -l option) or genome list file (with -l option)
  --presketched TEXT          clustering by the pre-generated sketch files rather than genomes
  --append TEXT Excludes: --input
                              append genome file or file list with the pre-generated sketch or MST files
  --save-rep                  save representative inverted index for incremental clustering (note: may slightly affect performance)
  --dense                     optional: enable density/ANI-related MST post-processing (high memory; default is off; KSSD presketched path)

# clust-leiden, graph-based clustering module for RabbitTClust (requires igraph)
Usage: ./clust-leiden [OPTIONS]
Options:
  -h,--help                   Print this help message and exit
  -t,--threads INT            set the thread number, default all CPUs of the platform
  -m,--min-length UINT        set the filter minimum length (minLen), genome length less than minLen will be ignore, default 10,000
  -k,--kmer-size INT          set the kmer size
  -l,--list                   input is genome list, one genome per line
  -e,--no-save                not save the intermediate files, such as sketches
  -d,--threshold FLOAT        set the distance threshold for graph edge construction
  -o,--output TEXT REQUIRED   set the output name of cluster result
  -i,--input TEXT Excludes: --presketched
                              set the input file, single FASTA genome file (without -l option) or genome list file (with -l option)
  --presketched TEXT          clustering by the pre-generated sketch files rather than genomes
  --pregraph TEXT             clustering from pre-built graph (fast resolution adjustment without rebuilding graph)
  --fast                      use the kssd algorithm for sketching and distance computing (required)
  --resolution FLOAT          resolution parameter for clustering (higher = more clusters, default 1.0)
  --louvain                   use Louvain algorithm instead of Leiden (default: Leiden)
  --drlevel INT               set the dimension reduction level for Kssd sketches, default 3 with a dimension reduction of 1/4096

Example:

# input is a file list, one genome path per line:
./clust-mst -l -i bact_refseq.list -o bact_refseq.mst.clust
./clust-greedy -l -i bact_genbank.list -o bact_genbank.greedy.clust

# input is a single genome file in FASTA format, one genome as a sequence:
./clust-mst -i bacteria.fna -o bacteria.mst.clust
./clust-greedy -i bacteria.fna -o bacteria.greedy.clust

# the sketch size (reciprocal of sampling proportion), kmer size, and distance threshold can be specified by -s (-c), -k, and -d options.
./clust-mst -l -k 21 -s 1000 -d 0.05 -i bact_refseq.list -o bact_refseq.mst.clust
./clust-greedy -l -k 21 -c 1000 -d 0.05 -i bact_genbank.list -o bact_genbank.greedy.clust


# for redundancy detection with clust-greedy, input is a genome file list:
# use -d to specify the distance threshold corresponding to various degrees of redundancy.
./clust-greedy -d 0.001 -l -i bacteria.list -o bacteria.out

# v.2.1.0 or later
# for last running of clust-mst, it generated a folder name in year_month_day_hour-minute-second format, such as 2023_05_06_08-49-15.
# this folder contains the sketch, mst files.
# for generator cluster from exist MST with a distance threshold of 0.045:
./clust-mst -d 0.045 --premsted 2023_05_06_08-49-15/ -o bact_refseq.mst.d.045.clust
# for generator cluster from exist sketches files of clust-mst with a distance threshold of 0.045:
./clust-mst -d 0.045 --presketched 2023_05_06_08-49-15/ -o bact_refseq.mst.d.045.clust

# for generator cluster from exist sketches of clust-greedy with a distance threshold of 0.001:
# folder 2023_05_06_08-49-15 contains the sketch files.
./clust-greedy -d 0.001 --presketched 2023_05_06_09-37-23/ -o bact_genbank.greedy.d.001.clust

# v.2.2.0 or later
# for generator cluster from exist part sketches (presketch_A_dir) and append genome set (genome_B.list) to incrementally clustering 
./clust-mst --presketched 

Related Skills

View on GitHub
GitHub Stars53
CategoryDevelopment
Updated5d ago
Forks6

Languages

C++

Security Score

85/100

Audited on Mar 27, 2026

No findings