Yanagi: Transcript Segment Library Construction for RNA-Seq Quantification

Source code based on the work presented in Yanagi: Fast and interpretable segment-based alternative splicing and gene expression analysis (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2947-6)

Update Nov 18th, 2019: Some major changes are pushed to improve the usability of the pipeline. And introduced the use of Yanagi-count as the preferred alignment tool based on RapMap's quasi mapping to provide segment counts.

Abstract

Analysis of differential alternative splicing from RNA-seq data is complicated by the fact that many RNA-seq reads map to multiple transcripts, and that annotated transcripts from a given gene are often a small subset of many possible complete transcripts for that gene. Here we describe Yanagi, a tool which segments a transcriptome into disjoint regions to create a segments library from a complete transcriptome annotation that preserves all of its consecutive regions of a given length L while distinguishing annotated alternative splicing events in the transcriptome. In this paper, we formalize this concept of transcriptome segmentation and propose an efficient algorithm for generating segment libraries based on a length parameter dependent on specific RNA-Seq library construction. The resulting segment sequences can be used with pseudo-alignment tools to quantify expression at the segment level. We characterize the segment libraries for the reference transcriptomes of Drosophila melanogaster and Homo sapiens. Finally, we demonstrate the utility of quantification using a segment library based on an analysis of differential exon skipping in Drosophila melanogaster and Homo sapiens. The notion of transcript segmentation as introduced here and implemented in Yanagi will open the door for the application of lightweight, ultra-fast pseudo-alignment algorithms in a wide variety of analyses of transcription variation.

Usage

SET UP

Requirements

Yanagi has been developed and tested in Python 3.7 and R 3.5. Yanagi uses the following modules:

Python:
- tqdm
R (Bioconductor):
- GenomicFeatures
- Biostrings

Download

Download yanagi by cloning the repository through the Clone or download button on the top right of this page. Or by running the clone command in git. Then change directory into the created directory where yanagi source is downloaded.

git clone https://github.com/HCBravoLab/yanagi.git
cd yanagi

Command and subcommand structure

Yanagi works with a command/subcommand structure:

yanagi.py subcommand options

where the subcommand can be one of these options:

preprocess : Preprocesses transcriptome annotation by breaking exons into disjoint exonic bins and find their transcript mapping.
segment : Generates a set of maximal L-disjoint segments from the preprocessed transcriptome annotation.
align : Pseudo aligns reads (single or paired-end) into the segments and obtain segment counts (single segment or segment pair counts).
psiCalc : Calculates PSI values of alternative splicing events based on their segment mappings.

Note: This tutorial assumes that all commands are excuted from inside the directory where yanagi is downloaded (refer to the previous Download section).

Annotation Preprocessing

Exons (and retained introns) in the transcriptome annotation can be overlapping within a gene (e.g. in 3'/5' splicing) or across genes. In order for Yanagi to guaranteeing L-disjointness property of the generated segments, a preprocessing step is needed to generate disjoint exonic bins. Yanagi generate disjoint exonic bins and their transcripts mappings from an input annotation file (GTF format) and the genome sequence file (FASTA format). (Note that the .fa file should contain the genome sequence file not the transcripts sequences.)

Command and options

To preprocess the transcriptome annotation subject to segmentation one has to run the following command in the following format:

python yanagi.py preprocess -gtf <gtf-file> -fa <fasta-file> -o <work-directory>

Note that throughout this tutorial, we will use the same directory <output-directory> as the working directory when needed in different commands.

Output files

The preprocess operation outputs two main files:

disjoint_bins.tsv: A file with the structural and sequence information of each constructed disjoint exonic bin.

	chr	start	end	strand	seq
1	1	11869	11871	+	GTT
2	1	11872	11873	+	AA
3	1	11874	12009	+	CTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCT...
4	1	12010	12057	+	GTGTCTGACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAAGCTGAG
...

txs2bins.tsv: A file with transcripts-to-bins information.

	chr	geneID	txID	bins	strand
1	1	ENSG00000223972	ENST00000456328	1,2,3,4,5,6,8,9,11,12,13,14,15,16,17,18	+
2	1	ENSG00000223972	ENST00000515242	2,3,4,5,6,8,9,12,13,14,15,16,17,18,19	+
3	1	ENSG00000223972	ENST00000518655	3,4,5,6,7,8,9,14,15,17,18	+
...

Another output file exons2bins.tsv is generated from that step. That extra file contains a mapping between the exons/introns annotated in the .gtf file and the disjoint exonic bins (reported in disjoint_bins.tsv file) that are used as the building blocks for the splice graph used inside of yanagi.

Alternative Splicing Events Generation

If the downstream analysis involves studying alternative splicing events present in the transcriptome. Then this step is needed to prepared the annotation of those events (Skip this step otherwise). Yanagi uses the same events definition and code used in SUPPA(eventGenerator command).

To generate the list of events given the GTF (unzipped) of the transcriptome one can run that command:

python eventGenerator.py -i <gtf-file> -o <output-directory-and-prefix> -f ioe -e <list-of-event-types-space-separated>

List of options available:

-e | --event-type: (only used for local AS events) space separated list of events to generate from the following list:
- SE: Skipping exon (SE)
- SS: Alternative 5' (A5) or 3' (A3) splice sites (generates both)
- MX: Mutually Exclusive (MX) exons
- RI: Retained intron (RI)
- FL: Alternative First (AF) and Last (AL) exons (generates both)

Note that a description of each event type and definition can be found on SUPPA's page. The command generates a separate .ioe file of the list of events of each event type provided in the event-type option. The shell script merge_ioe_files.sh can be edited for use to merge the separate .ioe files into one file, or to filter out events outside of the primary transcriptome assembly.

Segments Generation

This command executes the main operation preparing the segments library by Yanagi, to be used later for RNA-seq reads alignment. Yanagi takes the preprocessed transcriptome as input to build segments graph, which is then parsed to generate minimal L-disjoint segments.

Yanagi's Segments Example

Fig 1. The figure shows an illustrative example of transcriptome segmentation of one gene with three transcripts. The example shows the final segments generated by yanagi and how reads are aligned to them.

Command and options

To segment the transcriptome one has to run the following command in the following format:

python yanagi.py segment -l <read-length> -wd <work-directory>

List of options available:

-l | --max-overlap: This (integer) parameter value controls the maximum overlap between any two (L-disjoint) generated segments. A typical choice of l would equal to the expected read length. Refer to Yanagi's paper for more details on L-disjointness.
-wd | --work-dir: This is the work directory where the preprocessed annotation files exist (same output directory used in the preprocess subcommand). This directory must have two files disjoint_bins.tsv and txs2bins.tsv.
-o | --output-name: (Optional) This is a name prefix used to name output files. If not provided, the default output files are named in the format segs_<L>.
-ioe| --events-annotation: (Optional) This is a list of .ioe files annotating alternative splicing events present in the corresponding transcriptome. Used if downstream analysis is needed on alternative splicing events. More details in <a href="#-psicalc">Segment-Based PSI Calculation section</a>.

Output files

The segmentation operation outputs three files:

<output-name>.fa: A FASTA file of the segments library representing the transcriptome.

>SEG0000001
GCTAGATGCGGACACCTGGACCGCCGCGCCGAGGCTCCCGGCGCTCGCTGCTCCCGCGGCCCGCGCCATGCCCTCCT...
>SEG0000002
CCTGGACCGCCGCGCCGAGGCTCCCGGCGCTCGCTGCTCCCGCGGCCCGCGCCATGCCCTCCTACACGGT...
>SEG0000003
GGAATGACTTCGCCGACTTTGAGAAAATCTTTGTCAAGATCAGCAACACTATTTCTGAGCGGGTCATGAATCACTG...
>SEG0000004
GATCCGGCGCTGCACAGAGCTGCCCGAGAAGCTCCCGGTGACCACGGAGATGGTAGAGTGCAGCCTGGAG...
...

<output-name>.fa.meta: A file of metadata describing the structure of each segment and how it was formed.

segID	chrom	geneID	txAnnIDs	binIDs	st	end	strand
SEG0000001	10	ENSG00000012779	ENST00000542434	57010,57011	45869661	45869774	+
SEG0000002	10	ENSG00000012779	ENST00000374391,ENST00000542434	57011,57012,5701

Yanagi

Install / Use

README

Yanagi: Transcript Segment Library Construction for RNA-Seq Quantification

Source code based on the work presented in Yanagi: Fast and interpretable segment-based alternative splicing and gene expression analysis (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2947-6)

Update Nov 18th, 2019: Some major changes are pushed to improve the usability of the pipeline. And introduced the use of Yanagi-count as the preferred alignment tool based on RapMap's quasi mapping to provide segment counts.

Abstract

Usage

SET UP

Requirements

Download

Command and subcommand structure

Annotation Preprocessing

Command and options

Output files

Alternative Splicing Events Generation

Segments Generation

Command and options

Output files