SkillAgentSearch skills...

Centrifuger

Taxonomic classifier for sequencing reads (bulk and single-cell data, short and long read) using FM-index with run-block compressed BWT.

Install / Use

/learn @mourisl/Centrifuger
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Centrifuger

Described in:

Song, L., Langmead B.. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. Genome Biol. 2024 Apr 25;25(1):106. doi: 10.1186/s13059-024-03244-4. Best Paper Award at RECOMB2024

Copyright (C) 2023-, Li Song

What is Centrifuger?

Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. It implemented a novel lossless compression method, run-block comprssed BWT, and other strategies to efficiently reduce the size of the microbial genome database like RefSeq. For example, Centrifuger can classify reads against the 2023 RefSeq prokaryotic genomes containing about 140G nucleotides using 43 GB memory. Despite running on a compressed data structure, Centrifuger is also highly efficient and can process a typical sequencing sample within an hour.

Install

  1. Clone the GitHub repo, e.g. with git clone https://github.com/mourisl/centrifuger.git
  2. Run make in the repo directory

You will find the executable files in the downloaded directory. If you want to run Centrifuger without specifying the directory, you can either add the directory of Centrifuger to the environment variable PATH or create a soft link ("ln -s") of the file "centrifuger" to a directory in PATH.

Centrifuger depends on pthreads.

Centrifuger is also available from Bioconda. You can install Centrifuger with conda install -c conda-forge -c bioconda centrifuger.

Usage

Build index

Usage: ./centrifuger-build [OPTIONS]
  Required:
    -r FILE: reference sequence file (can use multiple -r to specify more than one input file)
        or
    -l FILE: list of reference sequence file stored in <file>, one sequence file per row. Can include the taxonomy ID mapping information in the second column.
    --taxonomy-tree FILE: taxonomy tree, i.e., nodes.dmp file
    --name-table FILE: name table, i.e., names.dmp file
  Optional:
    --conversion-table FILE: seqID to taxID conversion file
      When not set, expect -l option and the -l file should have two columns as "file taxID"
    -o STRING: output prefix [centrifuger]
    -t INT: number of threads [1]
    --build-mem STR: automatic infer bmax and dcv to match memory constraints, can use T,G,M,K to specify the memory size [not used]
    --bmax INT: block size for blockwise suffix array sorting [16777216]
    --dcv INT: difference cover period [4096]
    --offrate INT: SA/offset is sampled every (2^<int>) BWT chars [4]
    --subset-tax INT: only consider the subset of input genomes under taxonomy node <int> [0]
    --concat-tax-genome: concatenate the genomes with the same taxID and discard the seqID information [not used]
    --ignore-uncategorized-genome: ignore genomes whose seqID or taxID is missing or uncategorized. [include all]
    --checkpoint: add checkpoint (files [output_prefix]_checkpoint.[123]) for resuming index construction. [not used]

The default --bmax and --dcv option may be inefficient for building indexes for larger genome databases, please use --build-mem option to specify the rough estimation of the available memory.

Here is a list of pre-built indexes:

| Title | Link | Size/~Memory | Date | |-------|------|------|------| | Refseq human, bacteria, archea, virus + SARS-CoV2-variants from GenBank | Zenodo | 41G | 2023/10/01 | | GTDB r226 | Dropbox | 164G | 2025/05/20 | | GTDB r226 + Refseq human, virus, fungi, and contaminant (UniVec,EmVec)| Dropbox | 166G | 2025/05/20 | | NCBI core nt | Dropbox | 212G | 2025/06/11 |

(For the files on the Dropbox, you can right-click and "copy link" for each individual file and use "wget" on that link to download the file through the command line.)

Classification

Usage: ./centrifuger [OPTIONS] > classification.tsv
  Required:
    -x FILE: index prefix
    -1 FILE -2 FILE: paired-end read files
      or
    -u FILE: single-end read file
      or
    -i FILE: interleaved paried-end read file
      or
    --sample-sheet FILE: list of sample files, each row: "read1 read2 barcode UMI output". Use dot(.) to represent no such file.
  Optional:
    -t INT: number of threads [1]
    -k INT: report upto <int> distinct, primary assignments for each read pair [1]
    --un STR: output unclassified reads to files with the prefix of <str>, e.g. <str>_1/2.fq.gz
    --cl STR: output classified reads to files with the prefix of <str>
    --barcode STR: path to the barcode file
    --UMI STR: path to the UMI file
    --read-format STR: format for read, barcode and UMI files, e.g. r1:0:-1,r2:0:-1,bc:0:15,um:16:-1 for paired-end files with barcode and UMI
    --min-hitlen INT: minimum length of partial hits [auto]
    --hitk-factor INT: resolve at most <int>*k entries for each hit [40; use 0 for no restriction]
    --merge-readpair: merge overlapped paired-end reads and trim adapters 

Quantification (taxonomic profiling)

Usage: ./centrifuger-quant [OPTIONS] > report.tsv
  Required:
    -x FILE: index prefix
    -c FILE: classification file
  optional:
    --min-score INT: only consider reads with score at least <int> 
    --min-length INT: only consider reads with classified length at least <int>
    --output-format INT: output format. (0:centrifuge,default, 1:Metaphlan, 2:CAMI)        

The quantification results will be affected by the "-k" option from the classification program "centrifuger". Increasing "-k" will provide ambiguous but more specific classification result, potentially can improve the quantification result.

Input/Output

The primary input to Centrifuger is the index of the genome database (-x), and gzipped or uncompressed read fastq files (-1/-2 for paired; -u for single-end).

The output is to stdout, with the TSV format as following:

readID    seqID   taxID score      2ndBestScore    hitLength    queryLength numMatches
1_1       MT019531.1     2697049   4225       0               80   80      1

The first column is the read ID from a raw sequencing read (e.g., 1_1 in the example).
The second column is the sequence ID of the genomic sequence, where the read is classified (e.g., MT019531.1).
The third column is the taxonomic ID of the genomic sequence in the second column (e.g., 2697049).
The fourth column is the score for the classification, which is the weighted sum of hits (e.g., 4225)
The fifth column is the score for the next best classification (e.g., 0).
The sixth column is the number of base pairs of the read that match the genomic sequence found by Centrifuger (e.g., 80) 
The seventh column is the length of a read or the combined length of mate pairs (e.g., 80). 
The eighth column is the number of classifications for this read, indicating how many assignments were made in the output (e.g.,1).

The "centrifuger-quant" estimate the abundance for each taxonomy ID, and the quantification output has 7 columns.

name	taxID	taxRank	genomeSize	numReads	numUniqueReads	abundance
Legionella_pneumophila_subsp._pneumophila_str._Philadelphia_1	272624	strain	3397753	50	48	0.392641

The first column is the name of a genome, or the name corresponding to a taxonomic ID (the second column) at a rank higher than the strain (e.g., Legionella_pneumophila_str._Pari).
The second column is the taxonomic ID (e.g., 297246).
The third column is the taxonomic rank (e.g., strain).
The fourth column is the length of the genome sequence (e.g., 3503503).
The fifth column is the number of reads classified to some genomic sequences (multi-classified reads are evenly distributed) under this taxonomy node (e.g., 50).
The sixth column is the number of reads uniquely classified to a genomic sequence under this taxonomy node (e.g., 48).
The seventh column is the proportion of this genome normalized by its genomic length (e.g., 0.392641).

Practical notes

  • Create index for genomes from NCBI.

You can use "centrifuger-download" to download reference sequences from NCBI. The following two commands download the NCBI taxonomy to taxonomy/ in the current directory, and all complete archaeal, bacterial and viral genomes to library/.

./centrifuger-download -o taxonomy taxonomy
./centrifuger-download -o library -d "archaea,bacteria,viral" refseq > seqid2taxid.map

To add human (taxonomy ID 9606) or mouse (taxonomy ID 10090) genome to the downloaded files, you can use the following command

# human: T2T-CHM13
./centrifuger-download -o library -d "vertebrate_mammalian" -t 9606 refseq >> seqid2taxid.map
# human: hg38 reference genome
./centrifuger-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 9606 -c 'reference genome' refseq >> seqid2taxid.map
# mouse
./centrifuger-download -o library -d "vertebrate_mammalian" -a "Chromosome" -t 10090 -c 'reference genome' refseq >> seqid2taxid.map

To build the index, first put the downloaded files in a list (this part is different from Centrifuge, where the files need to be concatendated) and then run centrifuger-build:

find library -type f -name "*.fna.gz" > file.list # use *_dustmasked.fna.gz as the file list if using dustmasker in centrifuger-download 

## build centrifuger index with 4 threads on a server 

Related Skills

View on GitHub
GitHub Stars88
CategoryDevelopment
Updated13h ago
Forks9

Languages

C++

Security Score

95/100

Audited on Mar 28, 2026

No findings