CCMetagen
Microbiome classification pipeline
Install / Use
/learn @vrmarcelino/CCMetagenREADME
CCMetagen
CCMetagen processes sequence alignments produced with KMA, which implements the ConClave sorting scheme to achieve highly accurate read mappings. The pipeline is fast enough to use the whole NCBI nt collection as reference, facilitating the inclusion of understudied organisms, such as microbial eukaryotes, in metagenome surveys. CCMetagen produces ranked taxonomic results in user-friendly formats that are ready for publication or downstream statistical analyses.
If you this tool, please cite CCMetagen and KMA:
Besides the guidelines below, we also provide a tutorial to reproduce our metagenome clasisfication analyses of the microbiome of wild birds here.
The guidelines below will guide you in using the command-line version of the CCMetagen pipeline.
CCMetagen is also available as a web service at https://cge.food.dtu.dk/services/CCMetagen/. Note that we recommend using this command-line version to analyze data exceeding 1.5Gb.
Installation
We recommend installing CCMetagen through conda. This will install CCMetagen along with all of its required dependencies.
After installing conda, you can create a new environment with CCMetagen through the following command:
conda create -n ccmetagen ccmetagen -c bioconda -c conda-forge
You can then activate your environment with:
conda activate ccmetagen
The -n ccmetagen flag will name the environment as ccmetagen, but you can choose any different name that you'd like. The ccmetagen after that specifies that you'd like the CCMetagen package installed into that environment.
Finally, the -c bioconda -c conda-forge specifies that you'd like to use the Bioconda and Conda-Forge channels, which host CCMetagen and its dependencies.
You can also install CCMetagen from source, or using pip, the Python package manager. For that, follow the installation instructions in the deprecated README file in the docs.
Check your CCMetagen installation by running CCMetagen.py --version on your command-line.
Databases
After installing CCMetagen, you will need a reference database to perform taxonomic classification. There are two ways to obtain this:
Option 1 Download the indexed (ready-to-go) nt from here.
Download the ncbi_nt_kma file (103GB zipped file) or the RefSeq_bf.zip (90GB zipped file)
Unzip the database, e.g.: unzip ncbi_nt_kma.
The nt database contains the whole in NCBI nucleotide collection (from 2019, updated database to be released soon!), and therefore is suitable to identify a range of microorganisms, including prokaryotes and eukaryotes.
There are two versions of the nt database, the one previously mentioned, and another one that does not contain environemntal or artificial sequences. The file ncbi_nt_no_env_11jun2019.zip contains all ncbi nt entries excluding the descendants of environmental eukaryotes (taxid 61964), environmental prokaryotes (48479), unclassified sequences (12908) and artificial sequences (28384).
Option 2: Build your own reference database (recommended!)
Follow the instructions in the KMA website to index the database.
It is important that taxids are incorporated in sequence headers for processing with CCMetagen. Sequence headers should look like
>1234|sequence_description, where 1234 is the taxid.
We provide scripts to rename sequences in the nt database here.
If you want to use the RefSeq database, the format is similar to the one required for Kraken. The Opiniomics blog describes how to download sequences in an adequate format. Note that you still need to build the index with KMA: kma_index -i refseq.fna -o refseq_indexed -NI -Sparse - or kma_index -i refseq.fna -o refseq_indexed -NI -Sparse TG for faster analysis.
Quick Start
- First map sequence reads (or contigs) to the database with KMA.
For paired-end files:
kma -ipe $SAMPLE_R1 $SAMPLE_R2 -o sample_out_kma -t_db $db -t $th -1t1 -mem_mode -and -apm f
For single-end files:
kma -i $SAMPLE -o sample_out_kma -t_db $db -t $th -1t1 -mem_mode -and
If you want to calculate abundance in reads per million (RPM) or in number of reads (fragments), or if you want to calculate the proportion of mapped reads, add the flag -ef (extended features):
kma -ipe $SAMPLE_R1 $SAMPLE_R2 -o sample_out_kma -t_db $db -t $th -1t1 -mem_mode -and -apm f -ef
Where:
$dbis the path to the reference database$this the number of threads$SAMPLE_R1is the path to the mate1 of a paired-end metagenome/metatranscriptome sample (fastq or fasta)$SAMPLE_R2is the path to the mate2 of a paired-end metagenome/metatranscriptome sample (fastq or fasta)$SAMPLEis the path to a single-end metagenome/metatranscriptome file (reads or contigs)
Then run CCMetagen.py:
CCMetagen.py -i $sample_out_kma.res -o results
Where $sample_out_kma.res is alignment results produced by KMA.
Note that if you are running CCMetagen from the local folder (instead of adding it to your path), you may need to add 'python' before CCMetagen: python CCMetagen.py -i $sample_out_kma.res -o results
Done! This will make an additional quality filter and output a text file with ranked taxonomic classifications and a krona graph file for interactive visualization.
An example of the CCMetagen output can be found here (.csv file) and here (.html file).
<img src=docs/tutorial/figs_tutorial/krona_photo.png width="500" height="419.64">
In the .csv file, you will find the depth (abundance) of each match.
Abundance units
Depth can be estimated in four ways:
- By applying an additional correction for template length (default in KMA and CCMetagen);
- By counting the number of nucleotides matching the reference sequence (use flag --depth_unit nc);
- By calculating depth in Reads Per Million (RPM, use flag --depth_unit rpm); or
- By counting the number of fragments (i.e. number of PE reads matching to teh reference sequence, use flag --depth_unit fr). If you want RPM or fragment units, you will need to suply the .mapstats file generated with KMA (which you get when running kma with the flag '-ef').
Balancing sensitivity and specificity
You can adjust the stringency of the taxonomic assignments by adjusting the minimum coverage (--coverage), the minimum abundance (--depth), and the minimum level of sequence similarity (--query_identity). Coverage is the percentage of bases in the reference sequence that is covered by the consensus sequence (your query), it can be over 100% when the consensus sequence is larger than the reference (due to insertions for example). You can also adjust the KMA settings to facilitate the identification of more distant-related taxa (see below)
If you change the default depth unit, we recommend adjusting the minimum abundance (--depth) to remove taxa found in low abundance accordingly. For example, you can use -d 200 (200 nucleotides) when using --depth_unit nc, which is similar to -d 0.2 when using the default '--depth_unit kma' option. If you choose to calculate abundances in RPM, you may want to adjust the minimum abundance according to your sequence depth. For example, to calculate abundances in RPM, and filter out all matches with less than one read per million:
CCMetagen.py -i $sample_out_kma.res -o results -map $sample_out_kma.mapstat --depth_unit rpm --depth 1
If you would like to know the proportion of reads mapped to each template, run kma with the '-ef' flag. This will generate a file with the '.mapstat' extension. Then provide this file to CCMetagen (-map $sample_out_kma.mapstat) and add the flag '-ef y':
CCMetagen.py -i $sample_out_kma.res -o results -map $sample_out_kma.mapstat -ef y
This will filter the .mapstat file, removing the templates that did not pass CCMetagen's quality control, will add the percentage of mapped reads for each template and will output a file with extension 'stats_csv'. It will also output the overall proportion of reads mapped to these templates in the terminal. For more details about the additional columns of this file, please check KMA's manual.
When working with highly complex environemnts for which reference databases are scarce (e.g. many soil and marine metagenomes), it is common to obtain a low proportion of classified reads, especially if the sequencing depth is low. For a more sensitive analysis, besides relaxing the CCMetatgen settings, you can adjust the KMA aligner settings, by for example: removing the -and and the -apm f flags, so that you can get a match even
Related Skills
node-connect
345.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
104.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
