h1. Automated Assignment of Human Readable Descriptions (AHRD)

Short descriptions in sequence databases are useful to quickly gain insight into important information about a sequence, for example in search results. We developed a new program called “Automatic assignment of Human Readable Descriptions” (AHRD) with the aim to select descriptions and Gene Ontology terms that are concise, informative and precise. AHRD outperforms competing methods and can overcome problems caused by wrong annotations, lack of similar sequences and partial alignments.

h2. Table of contents

"Getting started":#1-getting-started

"Requirements":#11-requirements

"Installation":#12-installation

"Get AHRD":#121-get-ahrd

"Build the executable jar":#122-build-the-executable-jar

"Usage":#2-usage

"AHRD example usages":#21-ahrd-example-usages

"Input":#22-input

"Required input data":#221-required-input-data

"Optional input data":#222-optional-input-data

"Required config files":#223-required-config-files

"Test custom blacklists and filters":#2231-test-custom-blacklists-and-filters

"Batcher":#23-batcher

"Output":#24-output

"Tab-Delimited Table":#241-tab-delimited-table

"Fasta-Format":#242-fasta-format

"AHRD run using BLASTX results":#25-ahrd-run-using-blastx-results

"Parameter Optimization":#26-parameter-optimization

"Optimization in parallel (Trainer-Batcher)":#261-optimization-in-parallel-(trainer-batcher)

"Computing F-Scores for selected parameter sets (AHRD-Evaluator)":#27-computing-f-scores-for-selected-parameter-sets-ahrd-evaluator

"Algorithm":#3-algorithm

"Pseudo-Code":#31-pseudo-code

"Used Formulae and Parameters":#32-used-formulae-and-parameters

"Parameters":#33-parameters

"Parameters controlling the parsing of tabular sequence similarity search result tables (legacy BLAST, BLAST+, and BLAT)":#331-parameters-controlling-the-parsing-of-tabular-sequence-similarity-search-result-tables-legacy-blast-blast-and-blat

"Parameters controlling Gene Ontology term annotations":#332-parameters-controlling-gene-ontology-term-annotations

"Prefer reference proteins as candidates that have GO Term annotations":#3320-prefer-reference-proteins-as-candidates-that-have-go-term-annotations

"Custom reference Gene Ontology annotations (non UniprotKB GOA)":#3321-custom-reference-gene-ontology-annotations-non-uniprotkb-goa

"Custom Gene Ontology Database":#33221-custom-gene-ontology-database

"Testing":#4-testing

"License":#5-license

"Authors":#6-authors

"References":#7-references

h2. 1 Getting started

h3. 1.1 Requirements

AHRD is a Java-Program which requires @Java 1.7@ or higher and @ant@.

h3. 1.2 Installation

h4. 1.2.1 Get AHRD

Copy (clone) AHRD to your computer using git via command-line, then change into AHRD's directory, and finally use the latest stable version:

<pre>git clone https://github.com/groupschoof/AHRD.git cd AHRD git checkout tags/v3.3.3</pre>

Alternativelly without using @git@, you can download AHRD version @v3.3.3@ ("zip":https://github.com/groupschoof/AHRD/archive/v3.3.3.zip or "tar.gz":https://github.com/groupschoof/AHRD/archive/v3.3.3.tar.gz) and extract it.

h4. 1.2.2 Build the executable jar

Running

will create the executable JAR-File: @./dist/ahrd.jar@

h2. 2 Usage

All AHRD-Inputs are passed to AHRD in a single YML-File. See @./ahrd_example_input.yml@ for details. (About YAML-Format see <a href="http://en.wikipedia.org/wiki/YAML">Wikipedia/YAML</a>)

Basically AHRD needs a FASTA-File of amino acid sequences and different files containing the results from the respective BLAST searches, in our example we searched three databases: Uniprot/trEMBL, Uniprot/Swissprot and TAIR10. Note, that AHRD is generic and can make use of any number of different Blast databases that do not necessarily have to be the above ones. If e.g. annotating genes from a fungal genome searching yeast databases might be more recommendable than using TAIR (Arabidopsis thaliana).

All parameters can be set manually, or the default ones can be used as given in the example input file @ahrd_example_input.yml@ (see sections "2.1":#21-ahrd-example-usages and "3.2":#32-used-formulae-and-parameters for more details).

In order to parallelize the protein function annotation processes, AHRD can be run on batches of recommended size between 1,000 to 2,000 proteins. If you want to annotate very large protein sets or have low memory capacities use the included Batcher to split your input-data into Batches of appropriate size (see section "2.3":#23-batcher). Note: As of Java 7 or higher AHRD is quite fast and batching might no longer be necessary.

h3. 2.1 AHRD example usages

There are two template AHRD input files provided that you should use according to your use case. All example input files are stored in @./test/resources@ and are named @ahrd_example_input*.yml@. You can run AHRD on any of these use cases with <pre>java -Xmx2g -jar ./dist/ahrd.jar your_use_case_file.yml</pre>

| Use Case | Template File | | Annotate your Query proteins with Human Readable Descriptions (HRD) | @./test/resources/ahrd_example_input.yml@ | | Annotate your Query proteins with HRD and Gene Ontology (GO) terms | @./test/resources/ahrd_example_input_go_prediction.yml@ |

h3. 2.2 Input

Example files for all input files can be found under @./test/resources/@. NOTE: Only files containing @example@ in their filename should be used as template input files. Other YAML files are used for testing purposes.

h4. 2.2.1 Required input data

Protein sequences in fasta format

Sequence Similarity Search (@blastp@ or @blat@) results in tabular format

(If you run AHRD in batches the blast search results need to be batched in the same way as the fasta files.)

Recommended Sequence Similarity Search:

For your query proteins you should start independent BLAST searches e.g. in the three different databases mentioned above:

<pre> blastp -outfmt 6 -query query_sequences_AA.fasta -db uniprot_swissprot.fasta -out query_vs_swissprot.txt </pre>

h4. 2.2.2 Optional input data

If you want AHRD to predict your query protein's functions with Gene Ontology (GO) terms, you need to provide the GO annotations of reference proteins. See section "3.3.2":#332-parameters-controlling-gene-ontology-term-annotations for more details.

h4. 2.2.3 Required config files

Input yml with all pathes and parameters according to your needs (see ahrd_example_input.yml and section Parameters)

Blacklists and filters (they can either be used as provided or can be adapted to your needs and databases). Each of these files contains a list of valid Java regular expressions, one per line. For details on Java regular expressions please refer to http://docs.oracle.com/javase/tutorial/essential/regex/.

Description blacklist (Argument @blacklist: ./test/resources/blacklist_descline.txt@) - Any Blast-Hit's description matching one of the regular expressions in this file will be ignored.

Description filter for each single blast database (Argument @filter: ./test/resources/filter_descline_sprot.txt@) - Any part of a Blast-Hit description that matches any one of the regular expressions in this file will be deleted from the description.

Token blacklist (Argument @token_blacklist: ./test/resources/blacklist_token.txt@) - Blast-Hit's descriptions are composed of words. Any such word matching any one of the regular expressions in this file will be ignored by AHRD's scoring, albeit it will not be deleted from the description and thus might still be seen in the final output.

h5. 2.2.3.1 Test custom blacklists and filters

As explained in 2.2.3 AHRD makes use of blacklists and filters provided as Java regular expressions. You can test your own custom blacklists and filters:

Put the strings representing Blast-Hit descriptions or words contained in them in the file @./test/resources/regex_list.txt@. Note, that each line is interpreted as a single entry.

Put the Java regular expressions you want to test in file @./test/resources/match_list.txt@, using one regular expression per line.

Execute @ant test.regexs@ and study the output.

Example Output for test string "activity", and regular expressions "(?i)interacting", and "(?i)activity" applied in serial:

<pre>[junit] activity [junit] (?i)interacting -> activity [junit] (?i)activity -> </pre>

The above example demonstrates how the first regular expression does not match anything in the test string "activity", but after matching it against the second regular expression nothing remains, because the matched substring has been filtered out. As you can see, this test applies all provided regular expression in order of appearance and shows what remains of the provided test string after filtering with the provided regular expressions.

h3. 2.3 Batcher

AHRD provides a function to generate several input.yml files from large datasets, consisting of batches of query proteins. For each of these batches the user is expected to provide the batch's query proteins in FASTA format, and one Blast result file for each database searched. The AHRD batcher will then generate a unique input.yml file and entry in a batch shell script to execute AHRD on the respective batches in parallel for example on a compute cluster using a batch-system like LSF. We recommend this for very large datasets (more than a genome) or computers with low RAM.

To generate the mentioned input.yml files and batcher shell script that can subsequently be used to start AHRD in parallel use the batcher function as follows:

<pre>java -cp ./dist/ahrd.jar ahrd.controller.Batcher ./batcher_input_example.yml</pre>

You will have to edit @./batcher_input_example.yml@ and provide the following arguments. Note, that in the mentioned directories each file will be interpreted as belonging to one unique Batch, if and only if they have identical file names.

AHRD

Install / Use

README