Ggupset
Combination matrix axis for 'ggplot2' to create 'UpSet' plots
Install / Use
/learn @const-ae/GgupsetREADME
ggupset
Plot a combination matrix instead of the standard x-axis and create UpSet plots with ggplot2.
<img src="man/figures/README-violinexample-1.png" width="70%" />Installation
You can install the released version of ggupset from CRAN with:
# Download package from CRAN
install.packages("ggupset")
# Or get the latest version directly from GitHub
devtools::install_github("const-ae/ggupset")
Example
This is a basic example which shows you how to solve a common problem:
# Load helper packages
library(ggplot2)
library(tidyverse, warn.conflicts = FALSE)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ lubridate 1.9.3 ✔ tibble 3.2.1
#> ✔ purrr 1.0.2 ✔ tidyr 1.3.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load my package
library(ggupset)
In the following I will work with a tidy version of the movies dataset
from ggplot. It contains a list of all movies in IMDB, their release
data and other general information on the movie. It also includes a
list column that contains annotation to which genre a movie belongs
(Action, Drama, Romance etc.)
tidy_movies
#> # A tibble: 50,000 × 10
#> title year length budget rating votes mpaa Genres stars percent_rating
#> <chr> <int> <int> <int> <dbl> <int> <chr> <list> <dbl> <dbl>
#> 1 Ei ist ei… 1993 90 NA 8.4 15 "" <chr> 1 4.5
#> 2 Hamos sto… 1985 109 NA 5.5 14 "" <chr> 1 4.5
#> 3 Mind Bend… 1963 99 NA 6.4 54 "" <chr> 1 0
#> 4 Trop (peu… 1998 119 NA 4.5 20 "" <chr> 1 24.5
#> 5 Crystania… 1995 85 NA 6.1 25 "" <chr> 1 0
#> 6 Totale!, … 1991 102 NA 6.3 210 "" <chr> 1 4.5
#> 7 Visibleme… 1995 100 NA 4.6 7 "" <chr> 1 24.5
#> 8 Pang shen… 1976 85 NA 7.4 8 "" <chr> 1 0
#> 9 Not as a … 1955 135 2e6 6.6 223 "" <chr> 1 4.5
#> 10 Autobiogr… 1994 87 NA 7.4 5 "" <chr> 1 0
#> # ℹ 49,990 more rows
ggupset makes it easy to get an immediate impression how many movies
are in each genre and their combination. For example there are slightly
more than 1200 Dramas in the set, more than 1000 which don’t belong to
any genre and ~170 that are Comedy and Drama.
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_upset(n_intersections = 20)
#> Warning: Removed 100 rows containing non-finite outside the scale range
#> (`stat_count()`).
<img src="man/figures/README-unnamed-chunk-2-1.png" width="70%" />
Adding Numbers on top
The best feature about ggupset is that it plays well with existing
tricks from ggplot2. For example, you can easily add the size of the
counts on top of the bars with this trick from
stackoverflow
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
geom_text(stat='count', aes(label=after_stat(count)), vjust=-1) +
scale_x_upset(n_intersections = 20) +
scale_y_continuous(breaks = NULL, lim = c(0, 1350), name = "")
#> Warning: Removed 100 rows containing non-finite outside the scale range
#> (`stat_count()`).
#> Removed 100 rows containing non-finite outside the scale range
#> (`stat_count()`).
<img src="man/figures/README-unnamed-chunk-3-1.png" width="70%" />
Reshaping quadratic data
Often enough the raw data you are starting with is not in such a neat
tidy shape. But that is a prerequisite to make such ggupset plots, so
how can you get from wide dataset to a useful one? And how to actually
create a list-column, anyway?
Imagine we measured for a set of genes if they are a member of certain pathway. A gene can be a member of multiple pathways and we want to see which pathways have a large overlap. Unfortunately, we didn’t record the data in a tidy format but as a simple matrix.
A ficitional dataset of this type is provided as
gene_pathway_membership variable
data("gene_pathway_membership")
gene_pathway_membership[, 1:7]
#> Aco1 Aco2 Aif1 Alox8 Amh Bmpr1b Cdc25a
#> Actin dependent Cell Motility FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> Chemokine Secretion TRUE FALSE TRUE TRUE FALSE FALSE FALSE
#> Citric Acid Cycle TRUE TRUE FALSE FALSE FALSE FALSE FALSE
#> Mammalian Oogenesis FALSE FALSE FALSE FALSE TRUE TRUE FALSE
#> Meiotic Cell Cycle FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#> Neuronal Apoptosis FALSE FALSE FALSE FALSE FALSE FALSE FALSE
We will now turn first turn this matrix into a tidy tibble and then plot it
tidy_pathway_member <- gene_pathway_membership %>%
as_tibble(rownames = "Pathway") %>%
gather(Gene, Member, -Pathway) %>%
filter(Member) %>%
select(- Member)
tidy_pathway_member
#> # A tibble: 44 × 2
#> Pathway Gene
#> <chr> <chr>
#> 1 Chemokine Secretion Aco1
#> 2 Citric Acid Cycle Aco1
#> 3 Citric Acid Cycle Aco2
#> 4 Chemokine Secretion Aif1
#> 5 Chemokine Secretion Alox8
#> 6 Mammalian Oogenesis Amh
#> 7 Mammalian Oogenesis Bmpr1b
#> 8 Meiotic Cell Cycle Cdc25a
#> 9 Meiotic Cell Cycle Cdc25c
#> 10 Chemokine Secretion Chia1
#> # ℹ 34 more rows
tidy_pathway_member is already a very good starting point for plotting
with ggplot. But we care about the genes that are members of multiple
pathways so we will aggregate the data by Gene and create a
list-column with the Pathway information.
tidy_pathway_member %>%
group_by(Gene) %>%
summarize(Pathways = list(Pathway))
#> # A tibble: 37 × 2
#> Gene Pathways
#> <chr> <list>
#> 1 Aco1 <chr [2]>
#> 2 Aco2 <chr [1]>
#> 3 Aif1 <chr [1]>
#> 4 Alox8 <chr [1]>
#> 5 Amh <chr [1]>
#> 6 Bmpr1b <chr [1]>
#> 7 Cdc25a <chr [1]>
#> 8 Cdc25c <chr [1]>
#> 9 Chia1 <chr [1]>
#> 10 Csf1r <chr [1]>
#> # ℹ 27 more rows
tidy_pathway_member %>%
group_by(Gene) %>%
summarize(Pathways = list(Pathway)) %>%
ggplot(aes(x = Pathways)) +
geom_bar() +
scale_x_upset()
<img src="man/figures/README-unnamed-chunk-7-1.png" width="70%" />
What if I need more flexibility?
The first important idea is to realize that a list column is just as good as a character vector with the list elements collapsed
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
mutate(Genres_collapsed = sapply(Genres, function(x) paste0(sort(x), collapse = "-"))) %>%
select(title, Genres, Genres_collapsed)
#> # A tibble: 5,000 × 3
#> title Genres Genres_collapsed
#> <chr> <list> <chr>
#> 1 Ei ist eine geschissene Gottesgabe, Das <chr [1]> "Documentary"
#> 2 Hamos sto aigaio <chr [1]> "Comedy"
#> 3 Mind Benders, The <chr [0]> ""
#> 4 Trop (peu) d'amour <chr [0]> ""
#> 5 Crystania no densetsu <chr [1]> "Animation"
#> 6 Totale!, La <chr [1]> "Comedy"
#> 7 Visiblement je vous aime <chr [0]> ""
#> 8 Pang shen feng <chr [2]> "Action-Animation"
#> 9 Not as a Stranger <chr [1]> "Drama"
#> 10 Autobiographia Dimionit <chr [1]> "Drama"
#> # ℹ 4,990 more rows
We can easily make a plot using the strings as categorical axis labels
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
mutate(Genres_collapsed = sapply(Genres, function(x) paste0(sort(x), collapse = "-"))) %>%
ggplot(aes(x=Genres_collapsed)) +
geom_bar() +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5))
<img src="man/figures/README-unnamed-chunk-9-1.png" width="70%" />
Because the process of collapsing list columns into delimited strings is
fairly generic, I provide a new scale that does this automatically
(scale_x_mergelist()).
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_mergelist(sep = "-") +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5))
<img src="man/figures/README-unnamed-chunk-10-1.png" width="70%" />
But the problem is that it can be difficult to read those labels.
Instead I provide a third function that replaces the axis labels with a
combination matrix (axis_combmatrix()).
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_mergelist(sep = "-") +
axis_combmatrix(sep = "-")
<img src="man/figures/README-unnamed-chunk-11-1.png" width="70%" />
One thing that is only possible with the scale_x_upset() function is
to automatically order the categories and genres by freq or by
degree.
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_upset(order_by = "degree")
#> Warning: Removed 1076 rows containing non-finite outside the scale range
#> (`stat_count()`).
<img
