CoMeta
CoMeta (Classification of metagenomes) is a tool used to assigns a query read (DNA fragment) from metagenomic sample into one of the groups (the most common the taxon).
Install / Use
/learn @jkawulok/CoMetaREADME
CoMeta
CoMeta (<b>C</b>lassification <b>o</b>f <b>meta</b>genomes) is a tool used to assign a query read (a DNA fragment) from metagenomic sample into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions.
Licence: CoMeta software distributed under GNU GPL 2 licence.
This repository contains the current version of the program. The initial version of the Cometa is described in paper: <u><i>Jolanta Kawulok, Sebastian Deorowicz (2015) CoMeta: Classification of Metagenomes Using k-mers. PLoS ONE 10(4): e0121453. doi:10.1371/journal.pone.0121453 </i> </u> and is available at http://sun.aei.polsl.pl/cometa/.
Contact: <a href="mailto:jolanta.kawulok@polsl.pl">Jolanta Kawulok</a>
Contents
<ol type="A"> <li> Preparation of the working environment </li> <li> Description of the use of files that automate the databases building and classification </li> <li> Description of the various stages of building k-mer databases and read classification </li> <li> Selection of CoMeta parameters </li> </ol>A. Preparation of the working environment:
Installation, download files and programs, adding the taxonomic id (tax number) to the reference sequence
I ***** Necessary files and programs *****
1. Programs:
1.1 BLAST+ - Useful for reference sequences extraction from NCBI database
1.2 bioperl - For dividing reference sequence to the taxonomic groups
2. Files:
2.1 Set of reference sequences:
Prepare set of reference sequences which include number gi (in FASTA
format). In order classify to taxon, data includes nucleotide sequences could be downloaded from
ftp://ftp.ncbi.nlm.nih.gov/blast/db/ website (nt.00.tar.gz, nt.01.tar.gz, nt.02.tar.gz,...) and extracted.
For extracting sequences (from NCBI), unzip the nt.xx.tar.gz file and use command "blastdbcmd" from blast package from NCBI. E.g.:
$ ./blastdbcmd -entry all -db nt.xx -out sequences_ntxx.fa
where xx is number of nt file.
2.2 Taxonomic data:
Download and unzip the two taxonomic data from NCBI website: ftp://ftp.ncbi.nih.gov/pub/taxonomy
i. file: taxdump.tar.gz, which include:
- names.dmp – Taxonomy names
- nodes.dmp – Taxonomy nodes (hierarchy)
$ wget -c -P ./NCBI_tree_tax ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
$ cd ./NCBI_tree_tax/
$ gunzip taxdump.tar.gz
$ tar -xvf taxdump.tar
ii. file: gi_taxid_nucl.dmp.gz, which include
- gi_taxid_nucl.dmp
$ wget -c -P ./NCBI_tree_tax ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
$ gunzip -c ./NCBI_tree_tax/gi_taxid_nucl.dmp.gz > ./NCBI_tree_tax/gi_taxid_nucl.dmp
2.3 Module for bioperl:
Bio::LITE::Taxonomy::NCBI - For dividing reference sequence to the taxonomic groups
2.4 Boost version 1.51 or higher (for Boost/filesystem and Boost/thread libraries) - for instalation CoMeta program
change BOOST_LIB and BOOST_H in makefile to the directories where Boost is installed
II ***** CoMeta program *****
1. Directory structure:
bin - main directory of CoMeta (programs after compilation will be stored here), also it includes perl and shell scipts
src - folder with source codes
example - folder with sample data
makefile - file defines the prototype and library files used and the order of compilation
readme.txt - this file
2. Binaries:
After compilation you will obtain six binaries:
- tsk
- seq_gi2tax
- cometa
- class2best
- genlist
- num_class
III ***** Preparing reference sequences for taxonomic classification *****
In order to build the k-mer database for taxonomic classification, the taxonomic id (tax number) have to be added to the
single-line description of each reference sequence based on gi number.
In this purpose, use the program seq_gi2tax with file gi_taxid_nucl.dmp. Due to the huge file size, we suggest to divide it
into smaller parts. Before the first start of the program, use the argument -div1 to split the file:
$ ./seq_gi2tax -filGT<name> -pGT<path> -div1
And then attributing tax number can be started:
$ ./seq_gi2tax -filGT<name> -pGT<path> -in<path_name> -out<path_name>
where:
-in<path_name> - path and file name for input file with sequence which include GI number - from I.2.1 point(e.g., ./NT/sequences_nt00.fa)
-out<path_name> - path and file name for output file with sequence, where tax number is added (e.g., ./NT/sequences_nt00_TAX.fa)
-filGT<name> - file name with relation between gi number and tax number (default: gi_taxid_nucl.dmp)
-pGT<path> - path where is file with relation between gi number and tax number (path for gi_taxid_nucl.dmp file)
-div<0/1> - dividing gi_taxid_nucl.dmp file (default 0).
B. Description of the use of files that automate the databases building and classification
a simple example of usage is in "Example_auto_cometa.txt" file
There are two scripts to automate the building of databases and reads classification:
* CoMeta_bud_class_first - The script compares ready with each group. This is a classification used to start the taxonomic classification
or classification groups created by the user.
* CoMeta_bud_class_sing - The script for reads classification, which takes into account where reads have been classified to the higher level
(for the taxonomic classification).
I ***** CoMeta_bud_class_first *****
$ CoMeta_bud_class_first -maindir <DIR_MAIN_BIN> -S <FILE_META_SET> -WS <PATH_META_SET> -WD <DIR_TEMP_BIN> -tall <NR_THREADS> -t <NR_THREADCOMETA> \
-mr <NR_MEM> -proc <PROC_SIM> -k <LEN_KMER> -stepk <LEN_STEP> -mc <MATCH_CUT_OFF> -mm <0/1> -cl <-1/0/1> -suffDESCRP <SUFFIX_DESCRP> \
-listgr <LIST_GROUP> -OTU <TAX_SING_LEV> -dirin <DIR_MAIN_SCORE_CLASS> -WK <DIR_KMER_DATABASE> -diroutref <PATH_SEQ_REF> -OTUPREV <TAX_PREV_LEV> \
-dirncbitax <DIR_TAX_NCBI> -dirinref <PATH_IN_START_SEQ_REF> -divseq <0/1>
Description of the script parameters:
-maindir <DIR_MAIN_BIN> -the path where are all scripts
-S <FILE_META_SET> - file name of metagenomic set
-WS <PATH_META_SET> - path where is FILE_META_SET file
-WD <DIR_TEMP_BIN> - working directory for temporary files
-t <NR_THREADCOMETA> - total number of computional threads for single k-mer database and classification (default: 4)
-tall <NR_THREADS> - total number of computation threads (default: equal to no. of system cores). At the same time, "NUMJOBS" databases
are built and reads are compared with "NUMJOBS" groups, where NUMJOBS=NR_THREADS/NR_THREADCOMETA. Therefore, the multiple
of "NR_THREADCOMETA" is recommended.
-mr <NR_MEM> - max amount of RAM in GB
-proc <PROC_SIM> - similarity the best results, which are taken into account; default 100[%];
-k <LEN_KMER> - k-mer length (max 32); default: 24
-stepk <LEN_STEP> - k' - length of offset sliding window (default: length of k-mer)
-mc <MATCH_CUT_OFF> - the percent identity to classify a match (default: 5);
-mm <0/1> - taking into account the mismatch files; 0 - NO; 1 - YES
-cl<-1/0/1> - when reads are classified to a few groups, then reads are assignment to: -1 - any group; 0 - random group; 1 - all of these groups;
default: -1 \n"
-suffDESCRP <SUFFIX_DESCRP> - additional description of results (suffix)
-listgr <LIST_GROUP> - the list of substrings, which must be included in the name of the group to which the query reads are compared. For example,
in the folder of reference sequences/k-mer databases, there are data for bacterium, viruses, and eukaryotes, and we only
want to classify to bacteria and virus. Then, this command would be <-listgr "Bacteria Viruses"> (assuming that these names
appear in the file names).
-OTU <TAX_SING_LEV> - the generic name of the group to which reads are classified. For example, for the taxonomic classification: "phylum",
our: "mikedb".
-dirin <DIR_MAIN_SCORE_CLASS> - the path, where "TAX_SING_LEV" folder is created with the results
-WK <DIR_KMER_DATABASE> - the path, where "TAX_SING_LEV" folder is created with the k-mer databases
-diroutref <PATH_SEQ_REF> - the path, where "DIR_KMER_DATABASE/TAX_SING_LEV" folder contains reference sequences. For taxonomic classification,
the script creates "TAX_SING_LEV" folder with reference sequences (see the following parameters).
Parameters only for taxonomic classification:
-OTUPREV <TAX_PREV_LEV> - the name of a higher level to which reads are classified (eg, if TAX_SING_LEV = "phylum", "TAX_PREV_LEV" = "superkingom".
-dirncbitax <DIR_TAX_NCBI> - the path to the files of names.dmp and nodes.dmp
-dirinref <PATH_IN_START_SEQ_REF> - the path to folder with reference sequences that have not yet been divided by "TAX_SING_LEV" but containing
the taxonomy id
-divseq <0/1> - if reference sequences shall be divided by TAX_SING_LEV. 1 - yes, for the taxonomic classification, 0 - no, for their own
classification, or if it was earlier done
II ***** CoMeta_bud_class_sing *****
Reads classification is based on a higher level to which reads were classified. This script is executed for the target and each intermediate
classification levels.
$ CoMeta_bud_class_sing -maindir <DIR_MAIN_BIN> -S <FILE_META_SET> -WS <PATH_META_SET> -WD <DIR_TEMP_BIN> -tall <NR_THREADS> -t <NR_THREADCOMETA> \
-mr <NR_MEM> -proc <PROC_SIM> -k <LEN_KMER> -stepk <LEN_STEP> -mc <MATCH_CUT_OFF> -mm <0/1> -cl <-1/0/1> -suffDESCRP <SUFFIX_DESCRP> \
-OTU <TAX_SING_LEV> -dirin <DIR_MAIN_SCORE_CLASS> -WK <DIR_KMER_DATABASE> -diroutref <PATH_SEQ_REF> -OTUPREV <TAX_PREV_LEV> \
-dirncbitax <DIR_TAX_NCBI>
Description of the parameters is the same as for the script: CoMeta_bud_class_first.
Compared to the script CoMeta_bud_class
