CAFA2
Matlab Evaluation codes for the 2nd CAFA experiment
Install / Use
/learn @yuxjiang/CAFA2README
CAFA2
MATLAB evaluation codes used for the 2nd CAFA challenge. The CAFA2 paper is published in Genome Biology, and you can also find the latest arXiv version here.
How to build baseline predictors
BLAST predictor
Requirements
-
"Training" data
-
Sequences in FASTA format.
-
Annotations (MFO terms for exmaple) for each of these sequences. This data needs to be prepared ahead of time as a two-column CSV file (delimited by
TAB)<sequence ID> <GO term ID>where
<sequence ID>would be of any ID systems (e.g., UniProt accession number), as long as they are consistant with those used in the FASTA file.
-
-
NCBI BLAST tool (used 2.2.29+ for this document)
-
Query sequences in FASTA format.
Step-by-step
-
STEP 1: Load annotations of training sequences.
-
Load ontology structure(s)
Ontologies need to be load into a specific MATLAB structure which will be later used in evaluation. Here we provide two "adapters" for (i) OBO files or (ii) parsed plain-text files.
-
Load OBO files
ont = pfp_ontbuild('ontology.obo');Note that a typical gene ontology OBO file contains all three GO ontologies (i.e., MFO, BPO, and CCO), therefore,
pfp_ontbuildreturns a cell of THREE ontology strcutures instead:onts = pfp_ontbuild('go.obo');By default, they are ordered as BPO, CCO, MFO, alphabetically. You can also double check the
.ont_typefield of each returning structure. -
Load plain-text files
If you have already parsed an ontology, you can also save its term description and structure into the following two files and then load them into the same MATLAB structure as if using
pfp_ontbuild:ont = pfp_loadont('terms.tsv', 'relationship.tsv');where
terms.tsvis a two column file contains<term ID>and<term description>;relationship.tsvis a three column file contains<term ID> <relationship> <term ID>, (e.g.,GO:XXXXXXX is_a GO:YYYYYYY). Both files are delimited byTABand do not have header lines.
-
-
Load annotations onto the ontology structure(s)
Once
ontis created, you can load a list of sequence annotations using terms in this ontology.oa = pfp_oabuild(ont, 'annotation.dat');where
annotation.datis a two column tab-delimited file having<sequence ID> <term ID>annotation pairs in each line.
-
-
STEP 2: Prepare BLAST results
-
Run
blastpon the query sequences against the "training" sequences by setting output format to be the following:blastp ... -outfmt "6 qseqid sseqid evalue length pident nident" -out blastp.out -
Load the tabular output file (
blastp.outas shown above) into MATLAB:B = pfp_importblastp('blastp.out');
-
-
STEP 3: Build the BLAST predictor
Run the follow command in MATLAB to get a prediction structure:
blast = pfp_blast(qseqid, B, oa);where
qseqidis a cell list of query sequences on which you need scores. Note that it can be just a subset of all those you BLAST'ed.Bis the structure imported step 2, whileoais the ontology annotation structure loaded in step 1.Also, extra options can be specified as additional arguments so as to choose which feature you would like to use for creating BLAST predictions. By default, it used
sid: sequence identity. See the documentation inpfp_blast.mfor more details. Thus,blastwill be the BLAST predictor in MATLAB for evaluation.
Naive predictor
To build a *naive* predictor, all you need is the ontology annotation
structure `oa` that you have as in the step 1 of making a *BLAST* predictor.
Then run the following in MATLAB:
```matlab
naive = pfp_naive(qseqid, oa);
```
How to evaluate your own predictions on CAFA2 benchmarks
Evaluation codes are provided mainly for reproducing results in CAFA2
experiments. However, one may also use a subset of codes under `matlab/` to
evaluate their own protein function predictors.
Prerequisites
-
Represent protein sequences using CAFA2 target ID systems (e.g.,
T96060000019). Please checkbenchmark/folders for lists of benchmark proteins that needs to be covered. -
Save predictions in CAFA2 submission format according to CAFA rule. Although, headers (including
AUTHOR,MODEL,KEYWORDSandACCURACY) and footer (END) are optional. (Seecafa_import.mfor details)
Quick guide
-
Load ontologies into MATLAB structures.
-
You can use pre-built MATLAB structure for the same ontologies used in CAFA2 evaluation, which are located as
*.matfiles underontology/folder. -
We also provide functions for loading user specified ontologies, see
pfp_ontbuild.m. Note that it is suggested to use pre-built ontologies in order to compare results against published methods.
-
-
Prepare ground-truth annotations.
-
Similarly, ground-truth annotations for CAFA2
3681benchmark proteins are pre-built and saved as*.mat*files underbenchmark/groundtruth/. -
User specified annotations can be built using
pfp_oabuild.m, note that proteins have to use the same ID system as used for predictions. Also, see the comments for input arguments inpfp_oabuild.mfor details.oa = pfp_oabuild(ont, <annotation file>);
-
-
Load predictions into MATLAB structures.
This can be done by execute the following command in MATALB:
pred = cafa_import(<prediction file>, ont, false);with the 2nd argument
ontas the ontology structure built in the first step. We specify the 3rd argument to befalseindicating our<prediction file>don't contain headers and footer. -
Load benchmark protein IDs.
-
Protein IDs must be loaded as a
cellarray. You can use the following function:benchmark = pfp_loaditem(<benchmark list file>, 'char'); -
Various CAFA2 benchmark protein lists are prepared under
benchmark/lists/, load any one that meets your requirement.
-
-
Evaluation. (sequence-centered)
-
The easiest way to get an performance evaluation is to use the following function (in the case of F-max):
fmax = pfp_seqmetric(benchmark, pred, oa, 'fmax');See
pfp_seqmetric.mfor other metrics. -
Alternatively, you can compute confusion matrix so as to expose intermediate variables:
-
Make a confusion matrix structure
cm = pfp_seqcm(benchmark, pred, oa); -
Convert the
cmstructure to metrics of interest, here "precision-recall"seq_pr.metricwould have 101 precision-recall pairs corresponding to 101 thresholds from0.00up to1.00with step size0.01. You can use it to draw a PR curve.seq_pr = pfp_convcmstruct(cm, 'pr');
-
-
How to "replicate" CAFA2 evaluation experiment
Caveat
Due to CAFA rules, the organizers of CAFA cannot release the submitted
predictions from participants. Therefore, it is technically not possible to
replicate exact results (figures and tables) in the CAFA2 paper. Also, this
repository is not originally designed to be a software that is reusable as a
whole for protein function prediction tasks in general or even for future
CAFA challenges. As a result, the pipeline is not fully automatized and
manual input is necessary occasionally.
Please also notice that this pipeline is only tested on Linux version of
MATLAB (2016b), it is not guaranteed to work on other OS (code might have to
be adapted accordingly). We also used Bioinformatics toolbox for topological
ordering of ontology terms (`graphtopoorder`) in some Matlab functions.
However, it should be fairly easy to implement your own version if this
toolbox is not available.
With that being said, we provide scripts along with a minialist guideline to
assist researchers who would like to evaluate their own methods using CAFA2
benchmarks (along with their annotations by the time stated in the paper) so
as to compare their performances against CAFA2 baselines and possibly
against other methods.
Step-by-step
-
Download this repository to your local filesystem, say
/path/to/cafa2_repo, hereafter<cafa2repo>. -
Prepare an empty folder that have write permission, say
/path/to/another/folder, hereafter,<mydir>, for holding evaluation results. -
In Matlab, change working directory to
<cafa2repo>/matlaband setup<mydir>:cd <cafa2repo>/matlab; cafa_setup('<cafa2repo>', '<mydir>');This command sets up empty folders inside
<mydir>where intermediate/final results will sit. -
Place your plain-text prediction file into
<mydir>.Note that prediction files should be using CAFA format:
<target ID> <term ID> <score>for each line but without HEADER (those lines start withMODEL,AUTHOR,KEYWORDSetc.) or FOOTER (theENDline). Filename is suppose to beM001(M
