RSEM
RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
Install / Use
/learn @deweylab/RSEMREADME
README for RSEM
Bo Li (bli28 at mgh dot harvard dot edu)
Table of Contents
- Introduction
- Compilation & Installation
- Usage
- Example
- Simulation
- Generate Transcript-to-Gene-Map from Trinity Output
- Differential Expression Analysis
- Prior-Enhanced RSEM (pRSEM)
- Authors
- Acknowledgements
- License
<a name="introduction"></a> Introduction
RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data. The RSEM package provides an user-friendly interface, supports threads for parallel computation of the EM algorithm, single-end and paired-end read data, quality scores, variable-length reads and RSPD estimation. In addition, it provides posterior mean and 95% credibility interval estimates for expression levels. For visualization, It can generate BAM and Wiggle files in both transcript-coordinate and genomic-coordinate. Genomic-coordinate files can be visualized by both UCSC Genome browser and Broad Institute's Integrative Genomics Viewer (IGV). Transcript-coordinate files can be visualized by IGV. RSEM also has its own scripts to generate transcript read depth plots in pdf format. The unique feature of RSEM is, the read depth plots can be stacked, with read depth contributed to unique reads shown in black and contributed to multi-reads shown in red. In addition, models learned from data can also be visualized. Last but not least, RSEM contains a simulator.
<a name="compilation"></a> Compilation & Installation
To compile RSEM, simply run
make
For Cygwin users, run
make cygwin=true
To compile EBSeq, which is included in the RSEM package, run
make ebseq
To install RSEM, simply put the RSEM directory in your environment's PATH variable. Alternatively, run
make install
By default, RSEM executables are installed to /usr/local/bin. You
can change the installation location by setting DESTDIR and/or
prefix variables. The RSEM executables will be installed to
${DESTDIR}${prefix}/bin. The default values of DESTDIR and
prefix are DESTDIR= and prefix=/usr/local. For example,
make install DESTDIR=/home/my_name prefix=/software
will install RSEM executables to /home/my_name/software/bin.
Note that make install does not install EBSeq related scripts,
such as rsem-generate-ngvector, rsem-run-ebseq, and
rsem-control-fdr. But rsem-generate-data-matrix, which generates
count matrix for differential expression analysis, is installed.
Prerequisites
C++, Perl and R are required to be installed.
To use the --gff3 option of rsem-prepare-reference, Python is also
required to be installed.
To take advantage of RSEM's built-in support for the Bowtie/Bowtie 2/STAR/HISAT2 alignment program, you must have Bowtie/Bowtie2/STAR/HISAT2 installed.
<a name="usage"></a> Usage
I. Preparing Reference Sequences
RSEM can extract reference transcripts from a genome if you provide it with gene annotations in a GTF/GFF3 file. Alternatively, you can provide RSEM with transcript sequences directly.
Please note that GTF files generated from the UCSC Table Browser do not contain isoform-gene relationship information. However, if you use the UCSC Genes annotation track, this information can be recovered by downloading the knownIsoforms.txt file for the appropriate genome.
To prepare the reference sequences, you should run the
rsem-prepare-reference program. Run
rsem-prepare-reference --help
to get usage information or visit the rsem-prepare-reference documentation page.
<a name="built"></a> Build RSEM references using RefSeq, Ensembl, or GENCODE annotations
RefSeq and Ensembl are two frequently used annotations. For human and mouse, GENCODE annotaions are also available. In this section, we show how to build RSEM references using these annotations. Note that it is important to pair the genome with the annotation file for each annotation source. In addition, we recommend users to use the primary assemblies of genomes. Without loss of generality, we use human genome as an example and in addition build Bowtie indices.
For RefSeq, the genome and annotation file in GFF3 format can be found at RefSeq genomes FTP:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/
For example, the human genome and GFF3 file locate at the subdirectory
vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.31_GRCh38.p5. GCF_000001405.31_GRCh38.p5
is the latest annotation version when this section was written.
Download and decompress the genome and annotation files to your working directory:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.31_GRCh38.p5/GCF_000001405.31_GRCh38.p5_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.31_GRCh38.p5/GCF_000001405.31_GRCh38.p5_genomic.gff.gz
GCF_000001405.31_GRCh38.p5_genomic.fna contains all top level
sequences, including patches and haplotypes. To obtain the primary
assembly, run the following RSEM python script:
rsem-refseq-extract-primary-assembly GCF_000001405.31_GRCh38.p5_genomic.fna GCF_000001405.31_GRCh38.p5_genomic.primary_assembly.fna
Then type the following command to build RSEM references:
rsem-prepare-reference --gff3 GCF_000001405.31_GRCh38.p5_genomic.gff \
--trusted-sources BestRefSeq,Curated\ Genomic \
--bowtie \
GCF_000001405.31_GRCh38.p5_genomic.primary_assembly.fna \
ref/human_refseq
In the above command, --trusted-sources tells RSEM to only extract
transcripts from RefSeq sources like BestRefSeq or Curated Genomic. By
default, RSEM trust all sources. There is also an
--gff3-RNA-patterns option and its default is mRNA. Setting
--gff3-RNA-patterns mRNA,rRNA will allow RSEM to extract all mRNAs
and rRNAs from the genome. Visit here
for more details.
Because the gene and transcript IDs (e.g. gene1000, rna28655)
extracted from RefSeq GFF3 files are hard to understand, it is
recommended to turn on the --append-names option in
rsem-calculate-expression for better interpretation of
quantification results.
For Ensembl, the genome and annotation files can be found at Ensembl FTP.
Download and decompress the human genome and GTF files:
ftp://ftp.ensembl.org/pub/release-83/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
ftp://ftp.ensembl.org/pub/release-83/gtf/homo_sapiens/Homo_sapiens.GRCh38.83.gtf.gz
Then use the following command to build RSEM references:
rsem-prepare-reference --gtf Homo_sapiens.GRCh38.83.gtf \
--bowtie \
Homo_sapiens.GRCh38.dna.primary_assembly.fa \
ref/human_ensembl
If you want to use GFF3 file instead, which is unnecessary and not
recommended, you should add option --gff3-RNA-patterns transcript
because mRNA is replaced by transcript in Ensembl GFF3 files.
GENCODE only provides human and mouse annotations. The genome and annotation files can be found from GENCODE website.
Download and decompress the human genome and GTF files:
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/GRCh38.primary_assembly.genome.fa.gz
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_24/gencode.v24.annotation.gtf.gz
Then type the following command:
rsem-prepare-reference --gtf gencode.v24.annotation.gtf \
--bowtie \
GRCh38.primary_assembly.genome.fa \
ref/human_gencode
Similar to Ensembl annotation, if you want to use GFF3 files (not
recommended), add option --gff3-RNA-patterns transcript.
<a name="untypical"></a> Build RSEM references for untypical organisms
For untypical organisms, such as viruses, you may only have a GFF3 file that containing only genes but not any transcripts. You need to turn on --gff3-genes-as-transcripts so that RSEM will make each gene as a unique transcript.
Here is an example command:
rsem-prepare-reference --gff3 virus.gff \
--gff3-genes-as-transcripts \
--bowtie \
virus.genome.fa \
ref/virus
II. Calculating Expression Values
To calculate expression values, you should run the
rsem-calculate-expression program. Run
rsem-calculate-expression --help
to get usage information or visit the rsem-calculate-expression documentation page.
Calculating expression values from single-end data
For single-end models, users have the option of providing a fragment
length distribution via the --fragment-length-mean and
--fragment-length-sd options. The specification of an accurate fragment
length distribution is important for the accuracy of expression level
estimates from single-end data. If the fragment length mean and sd are
not provided, RSEM will not take a fragment length distribution into
consideration.
Using an alternative aligner
By default, RSEM automates the alignment of reads to reference
transcripts using the Bowtie aligner. Turn on --bowtie2 for
rsem-prepare-reference and rsem-calculate-expression will allow
RSEM to use the Bowtie 2 alignment program inste
