TALON
Technology agnostic long read analysis pipeline for transcriptomes
Install / Use
/learn @mortazavilab/TALONREADME
TALON
<img align="left" width="450" src="figs/TALON.png">TALON is a Python package for identifying and quantifying known and novel genes/isoforms in long-read transcriptome data sets. TALON is technology-agnostic in that it works from mapped SAM files, allowing data from different sequencing platforms (i.e. PacBio and Oxford Nanopore) to be analyzed side by side.
Table of contents
Reads must be aligned to the reference genome and oriented in the forward direction (5'->3') prior to using TALON. We recommend the Minimap2 aligner - please see their GitHub page here for recommended long-read parameters by technology. Please note that TALON requires the SAM MD tag, so Minimap2 should be run with the --MD flag enabled. In principle, you can use any other long-read alignment software provided that an MD tag is generated.
We also recommend correcting the aligned reads with TranscriptClean to fix artifactual noncanonical splice junctions, though this is not strictly necessary for TALON to run.
To learn more about how TALON works, please see our preprint in BioRxiv: https://www.biorxiv.org/content/10.1101/672931v1
<a name="installation"></a>Installation
Newer versions of TALON (v4.0+) are designed to be run with Python 3.6+.
To install TALON, simply download the files using Github's "Download ZIP" button, then unzip them in the directory where you would like to store the program. Alternately, you can download a specific version of the program from the Releases tab.
Go to the directory and run:
pip install cython
pip install .
This will install TALON. You can now run the commands from anywhere.
NOTE: Talon versions 4.2 and lower are not installable. Check the README of those releases to see how you can run the scripts from the install directory, or visit the wiki here.
<a name="how_to_run"></a>How to run
For a small, self-contained example with all necessary files included, see https://github.com/mortazavilab/TALON/tree/master/example
<a name="label_reads"></a>Flagging reads for internal priming
Current long-read platforms that rely on poly-(A) selection are prone to internal priming artifacts. These occur when the oligo-dT primer binds off-target to A-rich sequences inside an RNA transcript rather than at the end. Therefore, we recommend running the talon_label_reads utility on each of your SAM files separately to record the fraction of As in the n-sized window immediately following each read alignment (reference genome sequence). The default n value is 20 bp, but you can adjust this to match the length of the T sequence in your primer if desired. The output of talon_label_reads is a SAM file with the fraction As recorded in the fA:f custom SAM tag. Non-primary alignments are omitted. This SAM file can now be used as your input to the TALON annotator.
Usage: talon_label_reads [options]
Options:
-h, --help show this help message and exit
--f=SAM_FILE SAM file of transcripts
--g=GENOME_FILE Reference genome fasta file
--t=THREADS Number of threads to run
--ar=FRACA_RANGE_SIZE
Size of post-transcript interval to compute fraction
As on. Default = 20
--tmpDir=TMP_DIR Path to directory for tmp files. Default =
tmp_label_reads
--deleteTmp If this option is set, the temporary directory
generated by the program will be removed at the end of
the run.
--o=OUTPREFIX Prefix for outfiles
<a name="db_init"></a>Initializing a TALON database
The first step in using TALON is to initialize a SQLite database from the GTF annotation of your choice (i.e. GENCODE). This step is done using talon_initialize_database, and only needs to be performed once for your analysis. Keep track of the build and annotation names you choose, as these will be used downstream when running TALON and its utilities.
NOTE: The GTF file you use must contain genes, transcripts, and exons. If the file does not contain explicit gene and/or transcript entries, key tables of the database will be empty and you will experience problems in the downstream analysis. Please see our GTF troubleshooting section for help.
Usage: talon_initialize_database [options]
Options:
-h, --help Show help message and exit
--f GTF annotation file
--g The name of the reference genome build that the annotation describes. Use a short and memorable name since you will need to specify the genome build when you run TALON later.
--a The name of the annotation (for metadata purposes)
--l Minimum required transcript length (default = 0 bp)
--idprefix Prefix for naming novel discoveries in eventual TALON runs (default = 'TALON')
--5p Maximum allowable distance (bp) at the 5' end during annotation (default = 500 bp)
--3p Maximum allowable distance (bp) at the 3' end during annotation (default = 300 bp)
--o Output prefix for the database
<a name="run_talon"></a>Running TALON
Now that you've initialized your database and checked your reads for evidence of internal priming, you're ready to annotate them. The input database is modified in place to track and quantify transcripts in the provided dataset(s). In a talon run, each input SAM read is compared to known and previously observed novel transcript models on the basis of its splice junctions. This allows us to not only assign a novel gene or transcript identity where appropriate, but to track new transcript models and characterize how they differ from known ones. The types of novelty assigned are shown in this diagram. <img align="left" width="450" src="figs/novelty.png">
To run the talon annotator, create a comma-delimited configuration file with the following four columns: name, sample description, platform, sam file (full path). There should be one line for each dataset, and dataset names must be unique. If you decide later to add more datasets to an existing analysis, you can do so by creating a new config file for this data and running TALON again on the existing database.
If you're using the --cb option, the dataset names will be pulled from the SAM CB tag, making the first column of the config file unnecessary. Accordingly, TALON expects that when the --cb tag is provided, the config file only includes the following: sample description, platform, sam file (full path).
Please note that TALON versions 4.4+ can be run in multithreaded fashion for a much faster runtime.
usage: talon [-h] [--f CONFIG_FILE] [--cb] [--db FILE,] [--build STRING,]
[--threads THREADS] [--cov MIN_COVERAGE]
[--identity MIN_IDENTITY] [--nsg] [--o OUTPREFIX]
optional arguments:
-h, --help show this help message and exit
--f CONFIG_FILE Dataset config file: dataset name, sample description,
platform, sam file (comma-delimited)
--db FILE, TALON database. Created using
talon_initialize_database
--cb Use cell barcode tags to determine dataset. Useful for
single-cell data. Requires 3-entry config file.
--build STRING, Genome build (i.e. hg38) to use. Must be in the
database.
--threads THREADS, -t THREADS
Number of threads to run program with.
--cov MIN_COVERAGE, -c MIN_COVERAGE
Minimum alignment coverage in order to use a SAM
entry. Default = 0.9
--identity MIN_IDENTITY, -i MIN_IDENTITY
Minimum alignment identity in order to use a SAM
entry. Default = 0.8
--nsg, --create_novel_spliced_genes
Make novel genes with the intergenic novelty label for
transcripts that don't share splice junctions with any
other models
--tmpDir
Path to directory for tmp files. Default = `talon_tmp/`
--o OUTPREFIX Prefix for output files
TALON generates two output files in the course of a run. The QC log (file with suffix 'QC.log') is useful for tracking why a particular read was or was not included in the TALON analysis.
Columns:
- dataset
- read_ID
- passed_QC (1/0)
- primary_mapped (1/0)
- read_length
- fraction_aligned
- Identity
The second output file (suffix 'read_annot.tsv') appears at the very end of the run and contains a line for every read that was successfully annotated.
Columns:
- Name of individual read
- Name of dataset the read belongs to
- Name of genome build used in TALON run
- Chromosome
- Read start position (1-based). This refers to the 5' end start, so for reads on the - strand, this number will be larger than the read end (col 6).
- Read end position (1-based). This refers to the 3' end stop, so for reads on the - strand, this will be smaller than the read start (col 5)
