INTRO

The RDP Classifier is a naive Bayesian classifier which was developed to provide rapid taxonomic placement based on rRNA sequence data. The RDP Classifier can rapidly and accurately classify bacterial and archaeal 16s rRNA sequences, and Fungal LSU sequences. It provides taxonomic assignments from domain to genus, with confidence estimates for each assignment. The RDP Classifier likely can be adapted to additional phylogenetically coherent bacterial taxonomies. The online version of RDP Classifier can be found at http://rdp.cme.msu.edu/classifier/classifier.jsp.

How to cite Classifier? Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. 2007. Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 73(16):5261-7.

Gene Copy Number Adjustment Classifier provides gene copy number adjustment for 16S gene sequences (see http://rdp.cme.msu.edu/classifier/class_help.jsp#copynumber). The precompiled Classifier was trained with the 16S gene copy number data provided by rrnDB website. The Classifier can be trained with user-provided gene copy number data. See How to Train the Classifier below. Classifier outputs both copy number adjusted and unadjusted assignment counts in the hierarchical output files.

BIOM Format

The Classifier can take an input minimal (or rich) dense BIOM file as input with an optional Metadata file, and produces a rich dense BIOM file. If an input cluster BIOM file ( version 1.0) is provided, along with the representative sequences input (output from Clustering rep-seqs subcommand with "-c" option to make sure rep seq Ids match the cluster Ids in the BIOM file, see http://rdp.cme.msu.edu/tutorials/stats/RDPtutorial_statistics.html), the classification result of each sequence will replace the taxonomy of the corresponding cluster. If a metadata file is provided, the information will replace the metadata of the corresponding sample. The resulting rich dense BIOM file can be used by thirdparty tools such as phyloseq or QIIME etc.

In order to use BIOM files as input, the format must be specified in the command line with "-f biom". Then, the biom file is specified with the command "-m /path/to/biom_file.biom". Including a Metadata file is optional and can be included by using the command "-d /path/to/metadata.txt".

QUICKSTART

ant jar Some commands in this tutorial depend on RDP Clustering. See RDPTools (https://github.com/rdpstaff/RDPTools) to install.

USAGE

There are many subcommands offered by the Classifier package. The default subcommand is classify.

java -Xmx1g -jar /path/to/classifier.jar USAGE: ClassifierMain <subcommand> <subcommand args ...> classify - classify one or multiple samples crossvalidate - cross validate accuracy testing libcompare - compare two samples loot - leave one (sequence or taxon) out accuracy testing merge-detail - merge classification detail result files to create a taxon assignment counts file merge-count - merge multiple taxon assignment count files to into one count file random-sample - random select a subset or subregion of sequences rm-dupseq - remove identical or any sequence contained by another sequence rm-partialseq - remove partial sequences taxa-sim - calculate and plot the similarities within taxa train - retrain classifier

Classify one or more samples usage: [options] <samplefile>[,idmappingfile] ... -c,--conf <arg> assignment confidence cutoff used to determine the assignment count for each taxon. Range [0-1], Default is 0.8. -f,--format <arg> tab-delimited output format: [allrank|fixrank|biom|filterbyconf|db]. Default is allRank. allrank: outputs the results for all ranks applied for each sequence: seqname, orientation, taxon name, rank, conf, ... fixrank: only outputs the results for fixed ranks in order: domain, phylum, class, order, family, genus biom: outputs rich dense biom format if OTU or metadata provided filterbyconf: only outputs the results for major ranks as in fixrank, results below the confidence cutoff were bin to a higher rank unclassified_node db: outputs the seqname, trainset_no, tax_id, conf. -g,--gene <arg> 16srrna, fungallsu, fungalits_warcup, fungalits_unite. Default is 16srrna. This option can be overwritten by -t option -h,--hier_outfile <arg> tab-delimited output file containing the assignment count for each taxon in the hierarchical format. Default is null. -o,--outputFile <arg> tab-delimited text output file for classification assignment. -q,--queryFile legacy option, no longer needed -t,--train_propfile <arg> property file containing the mapping of the training files if not using the default. Note: the training files and the property file should be in the same directory. -w,--minWords <arg> minimum number of words for each bootstrap trial. Default(maximum) is 1/8 of the words of each sequence. Minimum is 5

[Example command to classify sequences ]: java -Xmx1g -jar /path/to/classifier.jar classify -c 0.5 -o usga_classified.txt -h soil_hier.txt samplefiles/USGA_2_4_B_trimmed.fasta

To speedup classification when large number of duplicate sequences exist in the inputs, you can dereplicate the input files first and use both the unique sequence fasta and idmapping file as input. The classification assignment output only contains the results of the unique sequences, the assignment counts in the hier_out_file are expanded to reflect the full sets. The hier_outfile can be imported to Excel to make plots, or loaded into R program as a data matrix.

[Example command to classify sequences with idmapping]: java -jar /path/to/Clustering.jar derep -u -o Native_1_4_A_derep.fasta Native_1_4_A.ids Native_1_4_A.sample samplefiles/Native_1_4_A_trimmed.fasta java -Xmx1g -jar /path/to/classifier.jar classify -c 0.5 -o native_classified.txt -h soil_hier.txt Native_1_4_A_derep.fasta,Native_1_4_A.ids java -Xmx1g -jar /path/to/classifier.jar classify -c 0.5 -f fixrank -o soil_classified.txt -h soil_hier.txt Native_1_4_A_derep.fasta,Native_1_4_A.ids samplefiles/USGA_2_4_B_trimmed.fasta

Notes: The bootstrap assignment strategy has been changed to avoid over-predication problem when multiple genera are tied for highest score occurred during bootstrap trials. This happens when every sequence in multiple genera (say N) contains the same partial sequence. One of the genera will be randomly chosen from the list of N genera with the highest tie score. If the tie score occurred during the genus assignment deterministic step, the first genus will be chosen. In this way, the genus assignment will remain deterministic but the bootstrap score will be close to 1/N .

By default, the Classifier output the results for all ranks applied for each sequence. Some users found the format "fixrank" useful to load into third party analysis tools. When "fixrank" is specified, the Classifier outputs the results in a fixed rank order as described above. In case of missing ranks in the lineage, the bootstrap value and the taxon name from the immediate lower rank will be reported. This eliminates the gaps in the lineage, but also introduces non-existing taxon name and rank. User should interpret the "fixrank" results with caution.

By default the Classifier chooses a subset of 1/8 of all the possible overlapping words from the query sequence for each bootstrap trial. The Classifier uses the minWords if the minWords is larger than 1/8 of words. Choosing more words helps gaining higher bootstrap values for short query sequence. Using larger "minWords" will increase the run time since the run time is proportional to the number and the length of the query sequences.
Compare two samples This command combines classification with a statistical test to flag taxa differing significantly between libraries.

[Example command from a terminal]: java -Xmx1g -jar /path/to/classifier.jar libcompare -q1 samplefiles/Native_1_4_A_trimmed.fasta -q2 samplefiles/USGA_2_4_B_trimmed.fasta -c 0.5 -o libcompare.txt
Merge classification results If you have classified samples at different time using the same training set, you can use this command to merge the classification results and reproduce the hier_outfile with the assignment counts from all the samples from the input. Each input classification result is treated as one sample. Note taxon and rank filter options only affect the assignment output, not the hier_outfile.

[Example command to merge two classification results]: java -Xmx1g -jar /path/to/classifier.jar merge-detail -h soil_hier.txt -o merged_classified.txt native_classified.txt,Native_1_4_A.ids usga_classified.txt

[Example command to merge two classification results, filter the classification output by confidence ]: java -Xmx1g -jar /path/to/classifier.jar merge-detail -h soil_hier.txt -o merged_classified.txt -f filterbyconf -c 0.5 native_classified.txt,Native_1_4_A.ids usga_classified.txt

[Example command to merge two classification results, only output classification results assigned to Alphaproteobacteria and confidence at family >= 0.5 ]: create a file

Classifier

Install / Use

README