SkillAgentSearch skills...

Ballgown

Bioconductor package "ballgown", devel version. Isoform-level differential expression analysis in R.

Install / Use

/learn @alyssafrazee/Ballgown
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Build Status

Introduction and Preprocessing

Ballgown is a software package designed to facilitate flexible differential expression analysis of RNA-Seq data. It also provides functions to organize, visualize, and analyze the expression measurements for your transcriptome assembly.

Before using the Ballgown R package, a few preprocessing steps are necessary:

  1. RNA-Seq reads should be aligned to a reference genome.
  2. A transcriptome should be assembled, or a reference transcriptome should be downloaded.
  3. Expression for the features (transcript, exon, and intron junctions) in the transcriptome should be estimated in a Ballgown readable format.

Two sample pipelines for preprocessing are as follows:

  1. Pipeline 1: TopHat2 (1) + Stringtie (2,3)
  2. TopHat2 [<a href="http://bioinformatics.oxfordjournals.org/content/25/9/1105.abstract">Trapnell et al. (2009)</a>] is built on the ultrafast short read mapping program Bowtie and aligns RNA-Seq reads to a genome while identifying exonic splice junctions. Sample command: tophat2 -G reference.gff -o outputDirectory -p 6 referenceIndex reads
  3. Stringtie [<a href="http://www.nature.com/nbt/journal/v33/n3/full/nbt.3122.html">M. Pertea et al. (2015)</a>] is a highly efficient assembler for RNA-Seq alignments using a novel network flow algorithm. It simultaneously assembles and quantifies expression levels for the features of the transcriptome in a Ballgown readable format (by using the option -B). One command to Stringtie satisfies steps 2 and 3 above. Sample command: stringtie -B -G reference.gff -p 6 accepted_hits.bam -o stringtie.gff
  4. Pipeline 2: TopHat2 (1) + Cufflinks (2) + Tablemaker (3)
  5. Tophat2 produces alignments as noted above.
  6. Cufflinks [<a href="http://dx.doi.org/10.1038/nbt.1621">Trapnell et al. (2010)</a>] also assembles transcriptomes from RNA-Seq data and quantifies their expression. Sample command: cufflinks -g reference.gff -o outputDirectory accepted_hits.bam
  7. Tablemaker calls Cufflinks to estimate feature expressions in a Ballgown readable format. Tablemaker access and instructions can be found here.

Installation

Start R and run:

if (!requireNamespace("BiocManager", quietly=TaRUE))
    install.packages("BiocManager")
BiocManager::install("ballgown")

Ballgown readable expression output

The files that Stringtie and Tablemaker produce for Ballgown to load is as follows:

  • e_data.ctab: exon-level expression measurements. One row per exon. Columns are e_id (numeric exon id), chr, strand, start, end (genomic location of the exon), and the following expression measurements for each sample:
    • rcount: reads overlapping the exon
    • ucount: uniquely mapped reads overlapping the exon
    • mrcount: multi-map-corrected number of reads overlapping the exon
    • cov average per-base read coverage
    • cov_sd: standard deviation of per-base read coverage
    • mcov: multi-map-corrected average per-base read coverage
    • mcov_sd: standard deviation of multi-map-corrected per-base coverage
  • i_data.ctab: intron- (i.e., junction-) level expression measurements. One row per intron. Columns are i_id (numeric intron id), chr, strand, start, end (genomic location of the intron), and the following expression measurements for each sample:
    • rcount: number of reads supporting the intron
    • ucount: number of uniquely mapped reads supporting the intron
    • mrcount: multi-map-corrected number of reads supporting the intron
  • t_data.ctab: transcript-level expression measurements. One row per transcript. Columns are:
    • t_id: numeric transcript id
    • chr, strand, start, end: genomic location of the transcript
    • t_name: Cufflinks-generated transcript id
    • num_exons: number of exons comprising the transcript
    • length: transcript length, including both exons and introns
    • gene_id: gene the transcript belongs to
    • gene_name: HUGO gene name for the transcript, if known
    • cov: per-base coverage for the transcript (available for each sample)
    • FPKM: Cufflinks-estimated FPKM for the transcript (available for each sample)
  • e2t.ctab: table with two columns, e_id and t_id, denoting which exons belong to which transcripts. These ids match the ids in the e_data and t_data tables.
  • i2t.ctab: table with two columns, i_id and t_id, denoting which introns belong to which transcripts. These ids match the ids in the i_data and t_data tables.

Loading data into R

The default directory structure produced by Stringtie and Tablemaker should mirror the extdata folder in the Ballgown pacakge:

extdata/
    sample01/
        e2t.ctab
        e_data.ctab
        i2t.ctab
        i_data.ctab
        t_data.ctab
    sample02/
        e2t.ctab
        e_data.ctab
        i2t.ctab
        i_data.ctab
        t_data.ctab
    ...
    sample20/
        e2t.ctab
        e_data.ctab
        i2t.ctab
        i_data.ctab
        t_data.ctab

Data is loaded using the ballgown function.

If your data is stored in directories matching the above structure (one root folder, subfolders named by sample, and .ctab files in the subfolders), you can use the dataDir and samplePattern arguments to load the data. samplePattern takes a regular expressions specifying the subfolders that should be included in the ballgown object:

library(ballgown)
data_directory = system.file('extdata', package='ballgown') # automatically finds ballgown's installation directory
# examine data_directory:
data_directory
## [1] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library/ballgown/extdata"
# make the ballgown object:
bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
bg
## ballgown instance with 100 transcripts and 20 samples

If your data is stored in a directory structure other than the one specified above, you can use the samples argument in the ballgown function: samples should be a vector (1-d array) with one entry per sample, where the entry gives the path to the folder containing that sample's .ctab files.

<!-- If you choose the `samples` option, you will also need to provide a vector called `sampleNames` that corresponds to `samples` and gives each sample a unique ID. Some example code to do this is: ```R sample_IDs = c('oneSample', 'anotherSample', 'aThirdSample') sample_paths = paste0('/home/', sample_IDs, '/ballgown') sample_paths ## [1] "/home/oneSample/ballgown" ## [2] "/home/anotherSample/ballgown" ## [3] "/home/aThirdSample/ballgown" bg = ballgown(samples=sample_paths, sampleNames=sample_IDs, meas='all') ``` -->

The result from either of these approaches is an object of class ballgown (named bg in these examples).

In the rest of this document, we use bg to refer to the first example, where samples are named sample01 through sample20.

A note for large experiments (with many samples or with large genomes): loading the data might require a lot of time and memory. In these cases, it's often useful to do the data loading in non-interactive mode. More specifically, you could create a script called load.R that contains these lines:

library(ballgown)
data_directory = system.file('extdata', package='ballgown') 
bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
save(bg, file='bg.rda')

You could then run this script non-interactively using R CMD BATCH: from the command line, run:

R CMD BATCH load.R

This may take some time, but when it finishes, the file bg.rda will be saved in the current directory, and you can read it back into R using the load() function. Rda files are usually only a few Gb on disk, even for large experiments. It is also possible to load only a subset of all the expression measurements by changing the meas argument to the ballgown function. For example, to only load transcript-level FPKMs, set meas = 'FPKM' and to load average coverage values and read counts, set meas=c('cov', 'rcount').

See ?ballgown for detailed information on creating Ballgown objects.

Accessing assembly data

A ballgown object has six slots: structure, expr, indexes, dirs, mergedDate, and meas.

structure

The structure slot depends heavily on the GenomicRanges Bioconductor package (<a href="http://dx.doi.org/10.1371/journal.pcbi.1003118">Lawrence et al. (2013)</a>). The slot specifies the structure, i.e., genomic locations and relationships between exons, introns, and transcripts, of the transcriptome assembly. It is convenient to represent exons and introns as intervals and to represent transcripts as a set of intervals (exons), so assembled exons and introns are available as GRanges objects, and the assembled transcripts are available as a GRangesList object. This means that useful range operations, such as findOverlaps and reduce, are readily available for assembled features.

Exon, intron, and transcript structures are easily extracted from the main ballgown object:

structure(bg)$exon
## GRanges object with 633 ranges and 2 metadata columns:
##         seqnames               ranges strand   |        id transcripts
##            <Rle>            <IRanges>  <Rle>   | <integer> <character>
##     [1]       18 [24412069, 24412331]      *   |        12          10
##     [2]       22 [17308271, 17308950]      +   |        55          25
##     [3]       22 [17309432, 17310226]      +   |        56          25
##     [4]       22 [18121428, 18121652]      +   |        88          35
##     [5]       22 [18138428, 18138598]      +   |        89          35
##     ...      ...                  ...    ... ..

Related Skills

View on GitHub
GitHub Stars148
CategoryDevelopment
Updated4mo ago
Forks55

Languages

R

Security Score

77/100

Audited on Nov 21, 2025

No findings