README for RSEM

Bo Li (bli28 at mgh dot harvard dot edu)

Introduction
Compilation & Installation
Usage
- Build RSEM references using RefSeq, Ensembl, or GENCODE annotations
- Build RSEM references for untypical organisms
Example
Simulation
Generate Transcript-to-Gene-Map from Trinity Output
Differential Expression Analysis
Prior-Enhanced RSEM (pRSEM)
Authors
Acknowledgements
License

<a name="introduction"></a> Introduction

RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels. For visualization, It can generate BAM and Wiggle files in both transcript-coordinate and genomic-coordinate. Genomic-coordinate files can be visualized by both UCSC Genome browser and Broad Institute's Integrative Genomics Viewer (IGV). Transcript-coordinate files can be visualized by IGV. RSEM also has its own scripts to generate transcript read depth plots in pdf format. The unique feature of RSEM is, the read depth plots can be stacked, with read depth contributed to unique reads shown in black and contributed to multi-reads shown in red. In addition, models learned from data can also be visualized. Last but not least, RSEM contains a simulator.

<a name="compilation"></a> Compilation & Installation

To compile RSEM, simply run

make

For Cygwin users, run

make cygwin=true

To compile EBSeq, which is included in the RSEM package, run

make ebseq

To install RSEM, simply put the RSEM directory in your environment's PATH variable. Alternatively, run

make install

By default, RSEM executables are installed to /usr/local/bin. You can change the installation location by setting DESTDIR and/or prefix variables. The RSEM executables will be installed to ${DESTDIR}${prefix}/bin. The default values of DESTDIR and prefix are DESTDIR= and prefix=/usr/local. For example,

make install DESTDIR=/home/my_name prefix=/software

will install RSEM executables to /home/my_name/software/bin.

Note that make install does not install EBSeq related scripts, such as rsem-generate-ngvector, rsem-run-ebseq, and rsem-control-fdr. But rsem-generate-data-matrix, which generates count matrix for differential expression analysis, is installed.

Prerequisites

C++, Perl and R are required to be installed.

To use the --gff3 option of rsem-prepare-reference, Python is also required to be installed.

To take advantage of RSEM's built-in support for the Bowtie/Bowtie 2/STAR/HISAT2 alignment program, you must have Bowtie/Bowtie2/STAR/HISAT2 installed.

<a name="usage"></a> Usage

I. Preparing Reference Sequences

RSEM can extract reference transcripts from a genome if you provide it with gene annotations in a GTF/GFF3 file. Alternatively, you can provide RSEM with transcript sequences directly.

Please note that GTF files generated from the UCSC Table Browser do not contain isoform-gene relationship information. However, if you use the UCSC Genes annotation track, this information can be recovered by downloading the knownIsoforms.txt file for the appropriate genome.

To prepare the reference sequences, you should run the rsem-prepare-reference program. Run

rsem-prepare-reference --help

to get usage information or visit the rsem-prepare-reference documentation page.

<a name="built"></a> Build RSEM references using RefSeq, Ensembl, or GENCODE annotations

RefSeq and Ensembl are two frequently used annotations. For human and mouse, GENCODE annotaions are also available. In this section, we show how to build RSEM references using these annotations. Note that it is important to pair the genome with the annotation file for each annotation source. In addition, we recommend users to use the primary assemblies of genomes. Without loss of generality, we use human genome as an example and in addition build Bowtie indices.

For RefSeq, the genome and annotation file in GFF3 format can be found at RefSeq genomes FTP:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

For example, the human genome and GFF3 file locate at the subdirectory vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.31_GRCh38.p5. GCF_000001405.31_GRCh38.p5 is the latest annotation version when this section was written.

Download and decompress the genome and annotation files to your working directory:

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.31_GRCh38.p5/GCF_000001405.31_GRCh38.p5_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.31_GRCh38.p5/GCF_000001405.31_GRCh38.p5_genomic.gff.gz

GCF_000001405.31_GRCh38.p5_genomic.fna contains all top level sequences, including patches and haplotypes. To obtain the primary assembly, run the following RSEM python script:

rsem-refseq-extract-primary-assembly GCF_000001405.31_GRCh38.p5_genomic.fna GCF_000001405.31_GRCh38.p5_genomic.primary_assembly.fna

Then type the following command to build RSEM references:

rsem-prepare-reference --gff3 GCF_000001405.31_GRCh38.p5_genomic.gff \
		       --trusted-sources BestRefSeq,Curated\ Genomic \
		       --bowtie \
		       GCF_000001405.31_GRCh38.p5_genomic.primary_assembly.fna \
		       ref/human_refseq

In the above command, --trusted-sources tells RSEM to only extract transcripts from RefSeq sources like BestRefSeq or Curated Genomic. By default, RSEM trust all sources. There is also an --gff3-RNA-patterns option and its default is mRNA. Setting --gff3-RNA-patterns mRNA,rRNA will allow RSEM to extract all mRNAs and rRNAs from the genome. Visit here for more details.

Because the gene and transcript IDs (e.g. gene1000, rna28655) extracted from RefSeq GFF3 files are hard to understand, it is recommended to turn on the --append-names option in rsem-calculate-expression for better interpretation of quantification results.

For Ensembl, the genome and annotation files can be found at Ensembl FTP.

Download and decompress the human genome and GTF files:

ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
ftp://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sapiens.GRCh38.83.gtf.gz

Then use the following command to build RSEM references:

rsem-prepare-reference --gtf Homo_sapiens.GRCh38.83.gtf \
		       --bowtie \
		       Homo_sapiens.GRCh38.dna.primary_assembly.fa \
		       ref/human_ensembl

If you want to use GFF3 file instead, which is unnecessary and not recommended, you should add option --gff3-RNA-patterns transcript because mRNA is replaced by transcript in Ensembl GFF3 files.

GENCODE only provides human and mouse annotations. The genome and annotation files can be found from GENCODE website.

Download and decompress the human genome and GTF files:

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh38.primary_assembly.genome.fa.gz
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/gencode.v24.annotation.gtf.gz

Then type the following command:

rsem-prepare-reference --gtf gencode.v24.annotation.gtf \
		       --bowtie \
		       GRCh38.primary_assembly.genome.fa \
		       ref/human_gencode

Similar to Ensembl annotation, if you want to use GFF3 files (not recommended), add option --gff3-RNA-patterns transcript.

<a name="untypical"></a> Build RSEM references for untypical organisms

For untypical organisms, such as viruses, you may only have a GFF3 file that containing only genes but not any transcripts. You need to turn on --gff3-genes-as-transcripts so that RSEM will make each gene as a unique transcript.

Here is an example command:

rsem-prepare-reference --gff3 virus.gff \
               --gff3-genes-as-transcripts \
               --bowtie \
               virus.genome.fa \
               ref/virus

II. Calculating Expression Values

To calculate expression values, you should run the rsem-calculate-expression program. Run

rsem-calculate-expression --help

to get usage information or visit the rsem-calculate-expression documentation page.

Calculating expression values from single-end data

For single-end models, users have the option of providing a fragment length distribution via the --fragment-length-mean and --fragment-length-sd options. The specification of an accurate fragment length distribution is important for the accuracy of expression level estimates from single-end data. If the fragment length mean and sd are not provided, RSEM will not take a fragment length distribution into consideration.

Using an alternative aligner

By default, RSEM automates the alignment of reads to reference transcripts using the Bowtie aligner. Turn on --bowtie2 for rsem-prepare-reference and rsem-calculate-expression will allow RSEM to use the Bowtie 2 alignment program inste

RSEM

Install / Use

README