CoMeta

CoMeta (Classification of metagenomes) is a tool used to assign a query read (a DNA fragment) from metagenomic sample into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions.

Licence: CoMeta software distributed under GNU GPL 2 licence.

This repository contains the current version of the program. The initial version of the Cometa is described in paper: Jolanta Kawulok, Sebastian Deorowicz (2015) CoMeta: Classification of Metagenomes Using k-mers. PLoS ONE 10(4): e0121453. doi:10.1371/journal.pone.0121453 and is available at http://sun.aei.polsl.pl/cometa/.

Contact: <a href="mailto:jolanta.kawulok@polsl.pl">Jolanta Kawulok</a>

<ol type="A"> <li> Preparation of the working environment </li> <li> Description of the use of files that automate the databases building and classification </li> <li> Description of the various stages of building k-mer databases and read classification </li> <li> Selection of CoMeta parameters </li> </ol>

A. Preparation of the working environment:

Installation, download files and programs, adding the taxonomic id (tax number) to the reference sequence

I ***** Necessary files and programs *****
1.	Programs:
	1.1 BLAST+ - Useful for reference sequences extraction from NCBI database
	1.2 bioperl - For dividing reference sequence to the taxonomic groups

2.	Files:
	2.1	Set of reference sequences:
		Prepare set of reference sequences which include number gi (in FASTA
		format). In order classify to taxon, data includes nucleotide sequences could be downloaded from
		ftp://ftp.ncbi.nlm.nih.gov/blast/db/ website (nt.00.tar.gz, nt.01.tar.gz, nt.02.tar.gz,...) and extracted.
	
		For extracting sequences (from NCBI), unzip the nt.xx.tar.gz file and use command "blastdbcmd" from blast package from NCBI. E.g.: 
			$ ./blastdbcmd -entry all -db nt.xx -out sequences_ntxx.fa
		where xx is number of nt file.
	
	2.2	Taxonomic data:
		Download and unzip the two taxonomic data from NCBI website: ftp://ftp.ncbi.nih.gov/pub/taxonomy
			i.	file: taxdump.tar.gz, which include:
				- names.dmp 	– Taxonomy names
				- nodes.dmp 	– Taxonomy nodes (hierarchy)
					$ wget -c  -P ./NCBI_tree_tax  ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
					$ cd  ./NCBI_tree_tax/
					$ gunzip taxdump.tar.gz
					$ tar -xvf taxdump.tar
				
			ii.	file: gi_taxid_nucl.dmp.gz, which include
				- gi_taxid_nucl.dmp
					$ wget -c -P ./NCBI_tree_tax  ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
					$ gunzip -c ./NCBI_tree_tax/gi_taxid_nucl.dmp.gz > ./NCBI_tree_tax/gi_taxid_nucl.dmp
			
	2.3	Module for bioperl:
		Bio::LITE::Taxonomy::NCBI - For dividing reference sequence to the taxonomic groups
		
	2.4 Boost version 1.51 or higher (for Boost/filesystem and Boost/thread libraries) - for instalation CoMeta program
		change BOOST_LIB and BOOST_H in makefile to the directories where Boost is installed
		
		
II ***** CoMeta program *****
1.	Directory structure:
		bin			- main directory of CoMeta (programs after compilation will be stored here), also it includes perl and shell scipts
		src			- folder with source codes
		example 	- folder with sample data
		makefile	- file defines the prototype and library files used and the order of compilation
		readme.txt	- this file
	
	
2.	Binaries:
	After compilation you will obtain six binaries:
		- tsk
		- seq_gi2tax
		- cometa
		- class2best
		- genlist
		- num_class
		
		
III ***** Preparing reference sequences for taxonomic classification *****
In order to build the k-mer database for taxonomic classification, the taxonomic id (tax number) have to be added to the 
single-line description of each reference sequence based on gi number. 
In this purpose, use the program seq_gi2tax with file gi_taxid_nucl.dmp. Due to the huge file size, we suggest to divide it 
into smaller parts. Before the first start of the program, use the argument -div1 to split the file:
	$ ./seq_gi2tax -filGT<name> -pGT<path> 	-div1
And then attributing tax number can be started:
	$ ./seq_gi2tax -filGT<name> -pGT<path> -in<path_name> -out<path_name> 		
where:
	-in<path_name> - path and file name for input file with sequence which include GI number - from I.2.1 point(e.g., ./NT/sequences_nt00.fa) 
	-out<path_name> - path and file name for output file with sequence, where tax number is added (e.g., ./NT/sequences_nt00_TAX.fa)
	-filGT<name> - file name with relation between gi number and tax number (default: gi_taxid_nucl.dmp)
	-pGT<path> - path where is file with relation between gi number and tax number (path for gi_taxid_nucl.dmp file)
	-div<0/1> - dividing gi_taxid_nucl.dmp file (default 0).

B. Description of the use of files that automate the databases building and classification

a simple example of usage is in "Example_auto_cometa.txt" file

There are two scripts to automate the building of databases and reads classification:
* CoMeta_bud_class_first - The script compares ready with each group. This is a classification used to start the taxonomic classification 
							or classification groups created by the user.
* CoMeta_bud_class_sing - The script for reads classification, which takes into account where reads have been classified to the higher level 
							(for the taxonomic classification).
							
I ***** CoMeta_bud_class_first *****
	$ CoMeta_bud_class_first -maindir <DIR_MAIN_BIN> -S <FILE_META_SET>  -WS <PATH_META_SET>  -WD <DIR_TEMP_BIN>  -tall <NR_THREADS>  -t <NR_THREADCOMETA> \
	-mr <NR_MEM>  -proc <PROC_SIM>  -k <LEN_KMER>  -stepk <LEN_STEP>  -mc <MATCH_CUT_OFF>  -mm <0/1>  -cl <-1/0/1>  -suffDESCRP <SUFFIX_DESCRP> \
	-listgr <LIST_GROUP>  -OTU <TAX_SING_LEV>  -dirin <DIR_MAIN_SCORE_CLASS>  -WK <DIR_KMER_DATABASE>  -diroutref <PATH_SEQ_REF>  -OTUPREV <TAX_PREV_LEV> \
	-dirncbitax <DIR_TAX_NCBI>  -dirinref <PATH_IN_START_SEQ_REF>  -divseq <0/1>  

Description of the script parameters:
	-maindir <DIR_MAIN_BIN> -the path where are all scripts 
	-S <FILE_META_SET> - file name of metagenomic set
	-WS <PATH_META_SET> - path where is FILE_META_SET file
	-WD <DIR_TEMP_BIN> - working directory for temporary files
	-t <NR_THREADCOMETA> - total number of computional threads for single k-mer database and classification  (default: 4)
	-tall <NR_THREADS> - total number of computation threads (default: equal to no. of system cores). At the same time, "NUMJOBS" databases 
						are built and reads are compared with "NUMJOBS" groups, where NUMJOBS=NR_THREADS/NR_THREADCOMETA. Therefore, the multiple 
						of "NR_THREADCOMETA" is recommended. 
	-mr <NR_MEM> - max amount of RAM in GB
	-proc <PROC_SIM> - similarity the best results, which are taken into account; default 100[%];
	-k <LEN_KMER> - k-mer length (max 32); default: 24
	-stepk <LEN_STEP> - k' - length of offset sliding window (default: length of k-mer)
	-mc <MATCH_CUT_OFF> - the percent identity to classify a match (default: 5);
	-mm <0/1> - taking into account the mismatch files; 0 - NO; 1 - YES
	-cl<-1/0/1> - when reads are classified to a few groups, then reads are assignment to: -1 - any group; 0 - random group; 1 -   all of these groups;
					default: -1 \n"
	-suffDESCRP <SUFFIX_DESCRP> - additional description of results (suffix)
	-listgr <LIST_GROUP> - the list of substrings, which must be included in the name of the group to which the query reads are compared. For example, 
							in the folder of reference sequences/k-mer databases, there are data for bacterium, viruses, and eukaryotes, and we only 
							want to classify to bacteria and virus. Then, this command would be <-listgr "Bacteria Viruses"> (assuming that these names
							appear in the file names).
	-OTU <TAX_SING_LEV> - the generic name of the group to which reads are classified. For example, for the taxonomic classification: "phylum", 
							our: "mikedb".
	-dirin <DIR_MAIN_SCORE_CLASS> - the path, where "TAX_SING_LEV" folder is created with the results
	-WK <DIR_KMER_DATABASE> - the path, where "TAX_SING_LEV" folder is created with the k-mer databases
	-diroutref <PATH_SEQ_REF> - the path, where "DIR_KMER_DATABASE/TAX_SING_LEV" folder contains reference sequences. For taxonomic classification, 
								the script creates "TAX_SING_LEV" folder with reference sequences (see the following parameters).		

Parameters only for taxonomic classification:
	-OTUPREV <TAX_PREV_LEV> - the name of a higher level to which reads are classified (eg, if TAX_SING_LEV = "phylum", "TAX_PREV_LEV" = "superkingom".
	-dirncbitax <DIR_TAX_NCBI> - the path to the files of names.dmp and nodes.dmp	
	-dirinref <PATH_IN_START_SEQ_REF> - the path to folder with reference sequences that have not yet been divided by "TAX_SING_LEV" but containing 
										the taxonomy id
	-divseq <0/1> - if reference sequences shall be divided by TAX_SING_LEV. 1 - yes, for the taxonomic classification, 0 - no, for their own 
					classification, or if it was earlier done
	
	
II ***** CoMeta_bud_class_sing *****
Reads classification is based on a higher level to which reads were classified.	This script is executed for the target and each intermediate 
classification levels.

	$ CoMeta_bud_class_sing -maindir <DIR_MAIN_BIN> -S <FILE_META_SET>  -WS <PATH_META_SET>  -WD <DIR_TEMP_BIN>  -tall <NR_THREADS>  -t <NR_THREADCOMETA> \
	-mr <NR_MEM>  -proc <PROC_SIM>  -k <LEN_KMER>  -stepk <LEN_STEP>  -mc <MATCH_CUT_OFF>  -mm <0/1>  -cl <-1/0/1>  -suffDESCRP <SUFFIX_DESCRP> \
	-OTU <TAX_SING_LEV>  -dirin <DIR_MAIN_SCORE_CLASS>  -WK <DIR_KMER_DATABASE>  -diroutref <PATH_SEQ_REF>  -OTUPREV <TAX_PREV_LEV> \
	-dirncbitax <DIR_TAX_NCBI> 

Description of the parameters is the same as for the script: CoMeta_bud_class_first.
Compared to the script CoMeta_bud_class

CoMeta

Install / Use

README

CoMeta

Contents

A. Preparation of the working environment:

B. Description of the use of files that automate the databases building and classification