TALON

TALON is a Python package for identifying and quantifying known and novel genes/isoforms in long-read transcriptome data sets. TALON is technology-agnostic in that it works from mapped SAM files, allowing data from different sequencing platforms (i.e. PacBio and Oxford Nanopore) to be analyzed side by side.

Installation
Running TALON
Working with the TALON results
Citing TALON

Reads must be aligned to the reference genome and oriented in the forward direction (5'->3') prior to using TALON. We recommend the Minimap2 aligner - please see their GitHub page here for recommended long-read parameters by technology. Please note that TALON requires the SAM MD tag, so Minimap2 should be run with the --MD flag enabled. In principle, you can use any other long-read alignment software provided that an MD tag is generated.

We also recommend correcting the aligned reads with TranscriptClean to fix artifactual noncanonical splice junctions, though this is not strictly necessary for TALON to run.

To learn more about how TALON works, please see our preprint in BioRxiv: https://www.biorxiv.org/content/10.1101/672931v1

<a name="installation"></a>Installation

Newer versions of TALON (v4.0+) are designed to be run with Python 3.6+.

To install TALON, simply download the files using Github's "Download ZIP" button, then unzip them in the directory where you would like to store the program. Alternately, you can download a specific version of the program from the Releases tab.

Go to the directory and run:

pip install cython
pip install .

This will install TALON. You can now run the commands from anywhere.

NOTE: Talon versions 4.2 and lower are not installable. Check the README of those releases to see how you can run the scripts from the install directory, or visit the wiki here.

<a name="how_to_run"></a>How to run

For a small, self-contained example with all necessary files included, see https://github.com/mortazavilab/TALON/tree/master/example

<a name="label_reads"></a>Flagging reads for internal priming

Current long-read platforms that rely on poly-(A) selection are prone to internal priming artifacts. These occur when the oligo-dT primer binds off-target to A-rich sequences inside an RNA transcript rather than at the end. Therefore, we recommend running the talon_label_reads utility on each of your SAM files separately to record the fraction of As in the n-sized window immediately following each read alignment (reference genome sequence). The default n value is 20 bp, but you can adjust this to match the length of the T sequence in your primer if desired. The output of talon_label_reads is a SAM file with the fraction As recorded in the fA:f custom SAM tag. Non-primary alignments are omitted. This SAM file can now be used as your input to the TALON annotator.

Usage: talon_label_reads [options]

Options:
  -h, --help            show this help message and exit
  --f=SAM_FILE          SAM file of transcripts
  --g=GENOME_FILE       Reference genome fasta file
  --t=THREADS           Number of threads to run
  --ar=FRACA_RANGE_SIZE
                        Size of post-transcript interval to compute fraction
                        As on. Default = 20
  --tmpDir=TMP_DIR      Path to directory for tmp files. Default =
                        tmp_label_reads
  --deleteTmp           If this option is set, the temporary directory
                        generated by the program will be removed at the end of
                        the run.
  --o=OUTPREFIX         Prefix for outfiles

<a name="db_init"></a>Initializing a TALON database

The first step in using TALON is to initialize a SQLite database from the GTF annotation of your choice (i.e. GENCODE). This step is done using talon_initialize_database, and only needs to be performed once for your analysis. Keep track of the build and annotation names you choose, as these will be used downstream when running TALON and its utilities.

NOTE: The GTF file you use must contain genes, transcripts, and exons. If the file does not contain explicit gene and/or transcript entries, key tables of the database will be empty and you will experience problems in the downstream analysis. Please see our GTF troubleshooting section for help.

Usage: talon_initialize_database [options]

Options:
  -h, --help           Show help message and exit
  --f                  GTF annotation file
  --g                  The name of the reference genome build that the annotation describes. Use a short and memorable name since you will need to specify the genome build when you run TALON later.
  --a                  The name of the annotation (for metadata purposes)
  --l                  Minimum required transcript length (default = 0 bp)
  --idprefix           Prefix for naming novel discoveries in eventual TALON runs (default = 'TALON')
  --5p                 Maximum allowable distance (bp) at the 5' end during annotation (default = 500 bp)
  --3p                 Maximum allowable distance (bp) at the 3' end during annotation (default = 300 bp)
  --o                  Output prefix for the database

<a name="run_talon"></a>Running TALON

Now that you've initialized your database and checked your reads for evidence of internal priming, you're ready to annotate them. The input database is modified in place to track and quantify transcripts in the provided dataset(s). In a talon run, each input SAM read is compared to known and previously observed novel transcript models on the basis of its splice junctions. This allows us to not only assign a novel gene or transcript identity where appropriate, but to track new transcript models and characterize how they differ from known ones. The types of novelty assigned are shown in this diagram. <img align="left" width="450" src="figs/novelty.png">

To run the talon annotator, create a comma-delimited configuration file with the following four columns: name, sample description, platform, sam file (full path). There should be one line for each dataset, and dataset names must be unique. If you decide later to add more datasets to an existing analysis, you can do so by creating a new config file for this data and running TALON again on the existing database.

If you're using the --cb option, the dataset names will be pulled from the SAM CB tag, making the first column of the config file unnecessary. Accordingly, TALON expects that when the --cb tag is provided, the config file only includes the following: sample description, platform, sam file (full path).

Please note that TALON versions 4.4+ can be run in multithreaded fashion for a much faster runtime.

usage: talon [-h] [--f CONFIG_FILE] [--cb] [--db FILE,] [--build STRING,]
             [--threads THREADS] [--cov MIN_COVERAGE]
             [--identity MIN_IDENTITY] [--nsg] [--o OUTPREFIX]

optional arguments:
  -h, --help            show this help message and exit  
  --f CONFIG_FILE       Dataset config file: dataset name, sample description,
                        platform, sam file (comma-delimited)  
  --db FILE,            TALON database. Created using
                        talon_initialize_database
  --cb                  Use cell barcode tags to determine dataset. Useful for
                        single-cell data. Requires 3-entry config file.
  --build STRING,       Genome build (i.e. hg38) to use. Must be in the
                        database.
  --threads THREADS, -t THREADS
                        Number of threads to run program with.
  --cov MIN_COVERAGE, -c MIN_COVERAGE
                        Minimum alignment coverage in order to use a SAM
                        entry. Default = 0.9
  --identity MIN_IDENTITY, -i MIN_IDENTITY
                        Minimum alignment identity in order to use a SAM
                        entry. Default = 0.8
  --nsg, --create_novel_spliced_genes
                        Make novel genes with the intergenic novelty label for
                        transcripts that don't share splice junctions with any
                        other models
  --tmpDir
                        Path to directory for tmp files. Default = `talon_tmp/`
  --o OUTPREFIX         Prefix for output files

TALON generates two output files in the course of a run. The QC log (file with suffix 'QC.log') is useful for tracking why a particular read was or was not included in the TALON analysis.

<details> <summary>QC log format</summary>

Columns:

dataset
read_ID
passed_QC (1/0)
primary_mapped (1/0)
read_length
fraction_aligned
Identity

</details>

The second output file (suffix 'read_annot.tsv') appears at the very end of the run and contains a line for every read that was successfully annotated.

<details> <summary>Read annotation file format</summary>

Columns:

Name of individual read
Name of dataset the read belongs to
Name of genome build used in TALON run
Chromosome
Read start position (1-based). This refers to the 5' end start, so for reads on the - strand, this number will be larger than the read end (col 6).
Read end position (1-based). This refers to the 3' end stop, so for reads on the - strand, this will be smaller than the read start (col 5)

TALON

Install / Use

README

TALON

Table of contents

<a name="installation"></a>Installation

<a name="how_to_run"></a>How to run

<a name="label_reads"></a>Flagging reads for internal priming

<a name="db_init"></a>Initializing a TALON database

<a name="run_talon"></a>Running TALON