Qapa
RNA-seq Quantification of Alternative Polyadenylation
Install / Use
/learn @morrislab/QapaREADME
RNA-seq Quantification of Alternative Polyadenylation (QAPA)
Analysis of alternative polyadenylation (APA) from RNA-seq data (human and mouse). QAPA consists of two main components:
- Extraction and annotation of 3′ UTR sequences from gene models
- Calculation of relative usage of alternative 3′ UTR isoforms based on transcript-level abundance.
Note that QAPA itself does not perform transcript quantification. It relies on other tools such as Sailfish and Salmon.
Installation
QAPA consists of both Python and R scripts. A conda virtual environment
can be created using the provided environment.yml file.
-
Clone the repo:
git clone https://github.com/morrislab/qapa.git cd qapa -
Create the environment
conda env create -f environment.yml conda activate qapa -
Test that
qapacommand is availableqapa --help
Usage
QAPA has three sub-commands:
build: Generate a 3′ UTR library from annotationsfasta: Extract sequences for indexing by transcript quantification toolsquant: Calculate relative usage of alternative 3′ UTR isoforms
1) Build 3′ UTRs from annotation (build)
Download pre-compiled libraries
Pre-compiled libraries for human and mouse are available for download below. Otherwise, continue reading to build from scratch.
Updated in v1.3.0, the following libraries are pre-compiled with Polyasite V2:
(Old) Versions v1.2.3 and earlier:
Prepare annotation files
To run build, gene and poly(A) annotation sources need to be prepared:
A. Gene annotation
-
Ensembl gene metadata table from Biomart.
Human and mouse tables are provided in the
examplesfolder. To obtain a fresh copy, download a table of Ensembl Genes from Biomart with the following attributes:- Ensembl Gene ID
- Ensembl Transcript ID
- Gene Type
- Transcript Type
- Gene Name
Alternatively, download via MySQL (see here for more details):
mysql --user=anonymous --host=martdb.ensembl.org --port=5316 -A ensembl_mart_89 \ -e "select stable_id_1023 as 'Gene stable ID', stable_id_1066 as 'Transcript stable ID', \ biotype_1020 as 'Gene type', biotype_1064 as 'Transcript type', \ display_label_1074 as 'Gene name' from mmusculus_gene_ensembl__transcript__main" \ > ensembl_identifiers.txtTo change the species, replace the table name (e.g. for human, use
hsapiens_gene_ensembl__transcript__main). -
GENCODE gene prediction annotation table
Download from UCSC Table Browser or alternatively via MySQL (see here for more details). For example, to download mm10 gene predictions:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A \ -e "select * from wgEncodeGencodeBasicVM9" mm10 > gencode.basic.txtAlternatively, if you are starting from a GTF/GFF file, you can convert it to genePred format using the UCSC tool
gtfToGenePred:gtfToGenePred -genePredExt custom_genes.gtf custom_genes.genePredNote that it is important to include the
-genePredExtoption!
B. Poly(A) site annotation (optional)
This step can be skipped, otherwise continue reading. Two options are available:
Option 1: standard approach using PolyASite and GENCODE poly(A) track (as described in the paper)
-
PolyASite database
Download BED files (human or mouse) from http://polyasite.unibas.ch/.
-
GENCODE poly(A) sites track
Download from UCSC Table Browser or alternatively via MySQL. For example, to download mm10 annotations:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A \ -e "select chrom, txStart, txEnd, name2, score, strand \ from wgEncodeGencodePolyaVM9 where name2 = 'polyA_site'" -N mm10 \ > gencode.polyA_sites.bed
Option 2: use custom BED track of poly(A) sites
A custom BED file of poly(A) sites can be used to annotate 3′ UTRs. Each entry must contain the start (0-based) and end coordinate of a poly(A) site.
Commands
Once the data files have been prepared, we can then use build to create the 3'
UTR library. See qapa build -h for usage details. The following
describes several example use cases:
-
To extract 3′ UTRs from annotation, run:
qapa build --db ensembl_identifiers.txt -g gencode.polyA_sites.bed -p clusters.mm10.bed gencode.basic.txt > output_utrs.bed -
If using a custom BED file, replace the
-gand-poptions with-o:qapa build --db ensembl_identifiers.txt -o custom_sites.bed gencode.basic.txt > output_utrs.bed -
If using a custom genePred file converted from GTF, supply the file as in 1. (e.g. the first positional argument):
qapa build --db ensembl_identifiers.txt -o custom_sites.bed custom_genes.genePred > output_utrs.bed -
If bypassing the poly(A) annotation step, include the
-Noption:qapa build -N --db ensembl_identifiers.txt gencode.basic.txt > output.utrs.bed
Results will be saved in the file output_utrs.bed (default is STDOUT).
It is important that the sequence IDs are not modified as it will be parsed by
quant below.
Additional notes:
- 3' UTRs that contain introns will be skipped.
- Chromosome names that contain underscores are currently not supported and will be skipped.
- Only genes with
Gene Type = 'protein_coding'will be considered.
Troubleshooting tips
- Use
--debugoption to produce more verbose logging messages - Use
--saveand--temp <dir_path>to save intermediate files generated byqapa build.<dir_path>should be a user accessible directory.
2) Extract 3′ UTR sequences (fasta)
To extract sequences from the BED file prepared by build, a reference genome in
FASTA format is required. e.g. http://hgdownload.soe.ucsc.edu/downloads.html.
Then, run the command:
qapa fasta -f genome.fa output_utrs.bed output_sequences.fa
Essentially fasta is a wrapper that calls bedtools getfasta. Note that
genome.fa must be uncompressed. Sequences will be saved in
output_sequences.fa.
Since the initial publication, Salmon added a 'selective alignment' procedure (Srivastava, Malik et al., Genome Biology, 2020). The 'full' selective alignment jointly aligns the reads to the input transcriptome and the genome, discarding alignments that map better to the genome than the transcriptome. To generate a fasta file compatible with Salmon's selective alignment procedure (a 'decoy-aware transcriptome'), run qapa fasta with the --decoys flag:
qapa fasta -f genome.fa --decoys output_utrs.bed output_sequences.fa
This will save the transcript sequences appended with the genome sequence from genome.fa to output_sequences.fa. This also generates a file named decoys.txt, which contains the sequence identifiers from the genome sequence to differentiate target transcripts from decoy sequences. The name of the text file can be controlled using the optional -d or --decoys-output-file argument.
3) Quantify 3′ UTR isoform usage (quant)
Expression quantification of 3′ UTR isoforms must be carried out first using the
FASTA file prepared by fasta as the index. For example, to index the sequences
using Salmon:
salmon index -t output_sequences.fa -i utr_library
If you generated a genome-augmented fasta file with qapa fasta --decoys, additionally pass the decoys.txt file to salmon index:
salmon index -t output_sequences.fa -i utr_library --decoys decoys.txt
Following expression quantification, QAPA expects the results to be located inside its own sub-directory. For example, typical Sailfish/Salmon results may appear with the following directory structure:
project/
|-- sample1/quant.sf
|-- sample2/quant.sf
|-- (etc.)
The quant sub-command calls two R scripts:
create_merged_table.R and compute_pau.R. The first script combines the
quantifications from each sample into a single table. The second script computes
the relative proportion of each isoform in a gene, measured as Poly(A) Usage
(PAU) (compute_pau.R).
qapa quant --db ensembl_identifiers.mm10.txt project/sample*/quant.sf > pau_results.txt
Results will be saved in the file pau_results.txt (default is STDOUT).
For advanced usage, the R scripts can be run on its own. Run
create_merged_table.R -h and compute_pau.R -h for usage details.
The final output format is as follows:
Column | Description
------ | -----------
APA_ID | unique identifier consisting of in the format <Ensembl Gene ID>_<number>_<(P\|D\|S)>, where P = proximal, D = distal, and S = single
Transcript | one or more Ensembl Transcript IDs
Gene | Ensembl Gene ID
Gene_Name | gene symbol
Chr | chromosome
LastExon.Start | start coordinate of last exon
LastExon.End | end coordinate of last exon
Strand | + or -
UTR3.Start | start coordinate of 3′ UTR
UTR3.End | end coordinate of 3′ UTR
L
