ggcoverage - Visualize and annotate omics coverage with ggplot2

GitHub
issues GitHub last
commit

Introduction

The goal of ggcoverage is to visualize coverage tracks from genomics, transcriptomics or proteomics data. It contains functions to load data from BAM, BigWig, BedGraph, txt, or xlsx files, create genome/protein coverage plots, and add various annotations including base and amino acid composition, GC content, copy number variation (CNV), genes, transcripts, ideograms, peak highlights, HiC contact maps, contact links and protein features. It is based on and integrates well with ggplot2.

It contains three main parts:

Load the data: ggcoverage can load BAM, BigWig (.bw), BedGraph, txt/xlsx files from various omics data, including WGS, RNA-seq, ChIP-seq, ATAC-seq, proteomics, et al.
Create omics coverage plot
Add annotations: ggcoverage supports six different annotations:
- base and amino acid annotation: Visualize genome coverage at single-nucleotide level with bases and amino acids.
- GC annotation: Visualize genome coverage with GC content
- CNV annotation: Visualize genome coverage with copy number variation (CNV)
- gene annotation: Visualize genome coverage across genes
- transcription annotation: Visualize genome coverage across different transcripts
- ideogram annotation: Visualize the region showing on whole chromosome
- peak annotation: Visualize genome coverage and peak identified
- contact map annotation: Visualize genome coverage with Hi-C contact map
- link annotation: Visualize genome coverage with contacts
- peotein feature annotation: Visualize protein coverage with features

Installation

ggcoverage is an R package distributed as part of the CRAN repository. To install the package, start R and enter one of the following commands:

# install via CRAN (not yet available)
install.packages("ggcoverage")

# OR install via Github
install.package("remotes")
remotes::install_github("showteeth/ggcoverage")

In general, it is recommended to install from the Github repository (updated more regularly).

Once ggcoverage is installed, it can be loaded like every other package:

library("ggcoverage")

Manual

ggcoverage provides two vignettes:

detailed manual: step-by-step usage
customize the plot: customize the plot and add additional layers

RNA-seq data

Load the data

The RNA-seq data used here is from Transcription profiling by high throughput sequencing of HNRNPC knockdown and control HeLa cells. We select four samples to use as example: ERR127307_chr14, ERR127306_chr14, ERR127303_chr14, ERR127302_chr14, and all bam files were converted to bigwig files with deeptools.

Load metadata:

# load metadata
meta_file <-
  system.file("extdata", "RNA-seq", "meta_info.csv", package = "ggcoverage")
sample_meta <- read.csv(meta_file)
sample_meta
#>        SampleName    Type Group
#> 1 ERR127302_chr14 KO_rep1    KO
#> 2 ERR127303_chr14 KO_rep2    KO
#> 3 ERR127306_chr14 WT_rep1    WT
#> 4 ERR127307_chr14 WT_rep2    WT

Load track files:

# track folder
track_folder <- system.file("extdata", "RNA-seq", package = "ggcoverage")
# load bigwig file
track_df <- LoadTrackFile(
  track.folder = track_folder,
  format = "bw",
  region = "chr14:21,677,306-21,737,601",
  extend = 2000,
  meta.info = sample_meta
)
# check data
head(track_df)
#>   seqnames    start      end width strand score    Type Group
#> 1    chr14 21675306 21675950   645      *     0 KO_rep1    KO
#> 2    chr14 21675951 21676000    50      *     1 KO_rep1    KO
#> 3    chr14 21676001 21676100   100      *     2 KO_rep1    KO
#> 4    chr14 21676101 21676150    50      *     1 KO_rep1    KO
#> 5    chr14 21676151 21677100   950      *     0 KO_rep1    KO
#> 6    chr14 21677101 21677200   100      *     2 KO_rep1    KO

Prepare mark region:

# create mark region
mark_region <- data.frame(
  start = c(21678900, 21732001, 21737590),
  end = c(21679900, 21732400, 21737650),
  label = c("M1", "M2", "M3")
)
# check data
mark_region
#>      start      end label
#> 1 21678900 21679900    M1
#> 2 21732001 21732400    M2
#> 3 21737590 21737650    M3

Load GTF

To add gene annotation, the gtf file should contain gene_type and gene_name attributes in column 9; to add transcript annotation, the gtf file should contain a transcript_name attribute in column 9.

gtf_file <-
  system.file("extdata", "used_hg19.gtf", package = "ggcoverage")
gtf_gr <- rtracklayer::import.gff(con = gtf_file, format = "gtf")

Basic coverage

The basic coverage plot has two types:

facet: Create subplot for every track (specified by facet.key). This is default.
joint: Visualize all tracks in a single plot.

joint view

Create line plot for every sample (facet.key = "Type") and color by every sample (group.key = "Type"):

basic_coverage <- ggcoverage(
  data = track_df,
  plot.type = "joint",
  facet.key = "Type",
  group.key = "Type",
  mark.region = mark_region,
  range.position = "out"
)

basic_coverage

Create group average line plot (sample is indicated by facet.key = "Type", group is indicated by group.key = "Group"):

basic_coverage <- ggcoverage(
  data = track_df,
  plot.type = "joint",
  facet.key = "Type",
  group.key = "Group",
  joint.avg = TRUE,
  mark.region = mark_region,
  range.position = "out"
)

basic_coverage

Facet view

basic_coverage <- ggcoverage(
  data = track_df,
  plot.type = "facet",
  mark.region = mark_region,
  range.position = "out"
)

basic_coverage

Custom Y-axis style

Change the Y-axis scale label in/out of plot region with range.position:

basic_coverage <- ggcoverage(
  data = track_df,
  plot.type = "facet",
  mark.region = mark_region,
  range.position = "in"
)

basic_coverage

Shared/Free Y-axis scale with facet.y.scale:

basic_coverage <- ggcoverage(
  data = track_df,
  plot.type = "facet",
  mark.region = mark_region,
  range.position = "in",
  facet.y.scale = "fixed"
)

basic_coverage

Add gene annotation

default behavior is to draw genes (transcripts), exons and UTRs with different line width
can bec adjusted using gene.size, exon.size and utr.size parameters
frequency of intermittent arrows (light color) can be adjusted using the arrow.num and arrow.gap parameters
genomic features are colored by strand by default, which can be changed using the color.by parameter

basic_coverage +
  geom_gene(gtf.gr = gtf_gr)

Add transcript annotation

In “loose” style (default style; each transcript occupies one line):

basic_coverage +
  geom_transcript(gtf.gr = gtf_gr, label.vjust = 1.5)

In “tight” style (attempted to place non-overlapping transcripts in one line):

basic_coverage +
  geom_transcript(
    gtf.gr = gtf_gr,
    overlap.style = "tight",
    label.vjust = 1.5
  )

Add ideogram

The ideogram is an overview plot about the respective position on a chromosome. The plotting of the ideogram is implemented by the ggbio package. This package needs to be installed separately (it is only ‘Suggested’ by ggcoverage).

library(ggbio)
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#>     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#>     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#>     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#>     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: ggplot2
#> Registere

Ggcoverage

Install / Use

README