Welcome to the blog

Posts

My thoughts and ideas

ggplot2 exercises* | Griffith Lab

Genomic Visualization and Interpretations

ggplot2 exercises*

We have previously covered the core aspects of ggplot2. In this section we provide some exercises to reinforce concepts.

Load example data

To start things off let’s go ahead and load in a transcripts annotation database to work with. Bioconductor maintains many of these databases for different species/assemblies, here we load in one from Bioconductor for the human reference genome (build HG38). You can view the many different transcript annotation databases bioconductor offers by looking for the TxDb BiocView on bioconductor.

# install if not already installed
# if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")

# load the library
library(ggplot2)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
TxDb <- TxDb.Hsapiens.UCSC.hg38.knownGene

It is beyond the purpose of this workshop to explain everything a TxDb object stores. In the code below we are just extracting gene related data from chromosome 1-22,X,Y to plot. Note that in the dataframe we produce below the Gene ID is an Entrez ID.

# obtain gene locations from the TxDb object as a data frame
genes <- genes(TxDb, columns=c("gene_id"), single.strand.genes.only=FALSE)
genes <- as.data.frame(genes)
chromosomes <- paste0("chr", c(1:22,"X","Y"))
genes <- genes[genes$seqnames %in% chromosomes,]
genes <- unique(genes[,c("group_name", "seqnames", "start", "end", "width", "strand")])
colnames(genes) <- c("gene_id", "seqnames", "start", "end", "width", "strand")

Let’s also grab a list of immune genes and annotate our data with these as well. The immune genes here come from a pancancer immune profiling panel.

# read in the immune gene list
immuneGenes <- read.delim("http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/immuneGenes.tsv")

# merge the immune gene annotations onto the gene object
genes <- merge(genes, immuneGenes, by.x=c("gene_id"), by.y=c("entrez"), all.x=TRUE)
genesOnChr6 <- genes[genes$seqnames %in% "chr6",]

Exercise 1. Immune gene locations

Okay, we have the core data we’ll be using for this section, let’s go ahead and use what we learned in the last section to answer a couple biologically relevant questions adding layers as we go to create more complex plots. Let’s first start with a relatively easy plot. Try plotting the genomic center of all immune genes on chromosome 6 with geom_jitter(), making sure to only jitter the height. You should use the genesOnChr6 object we created above, at the end you should see something like the plot below. Hint, we only annotated immune genes, you can use na.omit() to remove any rows with NA values.

Get a hint!

you can calculate the center of the gene with the start and width columns in the DF, this can be done inside or outside of ggplot2

Get a hint!

There is a parameter in geom_jitter() to control the ammount of jitter in both the x and y directions

Get a hint!

What is passed to the Y axis does not matter numerically as long as it is consistent!

What is the code to produce the plot below?

ggplot(na.omit(genesOnChr6), aes(x=start + .5*width, y=1)) + geom_jitter(width=0, alpha=.75) + xlab("Chromosomal position of genes (gene start + 0.5 * width)")

Okay now let’s try and highlight all HLA genes on the plot we just created, to do this we’ll need a column specifying which genes are HLA related, the code below will do that for you.

# make column specifying what genes are HLA or not
genesOnChr6$isHLA <- ifelse(grepl("^HLA", genesOnChr6$Name), "HLA Gene", "Not HLA Gene")

Try to mimic the plot displayed below.

Get a hint!

You only need to change one thing here in the aes() call

What is the code to produce the plot below?

ggplot(na.omit(genesOnChr6), aes(x=start + .5*width, y=1, color=isHLA)) + geom_jitter(width=0, alpha=.75) + xlab("Chromosomal position of genes (gene start + 0.5 * width)")

Finally let’s make some stylistic changes and clean the plot up a bit. Use theme() to remove the y-axis elements and add some custom colors to the points as is done in the plot below.

Get a hint!

to change the color you need to add one of the scale_*_manual(), i.e. scale_shape_manual, scale_linetype_manual() etc.

What is the code to produce the plot below?

ggplot(na.omit(genesOnChr6), aes(x=start + .5*width, y=1, color=isHLA)) + geom_jitter(width=0, alpha=.75) + scale_color_manual(values=c("seagreen3", "darkorange3")) + theme(axis.text.y=element_blank(), axis.title.y=element_blank(), axis.ticks.y=element_blank()) + xlab("Chromosomal position of genes (gene start + 0.5 * width)")

Exercise 2. Gene density by chromosome

Okay let’s try another example, lets look at the density of genes across all genomic coordinates. We’ll start simple and keep adding layers to work our way up to a final plot. Try to recreate the plot below by plotting the density of the genes acorss the genome. Use the gene center coordinate of the gene for this.

Get a hint!

Reminder, to plot the center coordinate of the gene you will need to make another columns in the data frame, you can do this within ggplot or just add another column.

What is the code to produce the plot below?

ggplot(data=genes, aes(x=start + .5*width,)) + geom_density() + xlab("Chromosomal position of genes (gene start + 0.5 * width)")

Pretty easy right? As you would expect gene density is higher toward the beginning of chromosomes simply because there is more overlap of genomic coordinates between chromosomes at the start (i.e. all chromosomes start at 1, but are of different lengths). Now let’s add to our plot by creating a rug layer for genes on the anti-sense strand. Place this rug on the top of the plot, alter the transparency, color, and size of the rug. when your done your plot should look similar to the one below.

Get a Hint!

Look at the ggplot2 documentation for geom_rug()

Get a Hint!

You will need to pass a subsetted data frame directly to geom_rug() with the data parameter

Get a Hint!

within geom_rug() you want to pay attention to the sides and length parameters, again see ggplot2 documentation

What is the code to produce the plot below

ggplot(data=genes, aes(x=start + .5*width,)) + geom_density() + geom_rug(data=genes[genes$strand == "-",], aes(x=start + .5*width), color="tomato3", sides="t", alpha=.1, length=unit(0.1, "npc")) + xlab("Chromosomal position of genes (gene start + 0.5 * width)")

While were at it, let’s go ahead and add a layer for the sense strand as well, this will go on the bottom of the plot instead of the plot. Go ahead and try and mimic the plot below.

Get a Hint!

Same concept as above, you should have 2 geom_rug() layers for this plot

What is the code to produce the plot below

ggplot(data=genes, aes(x=start + .5*width,)) + geom_density() + geom_rug(data=genes[genes$strand == "-",], aes(x=start + .5*width), color="tomato3", sides="t", alpha=.1, length=unit(0.1, "npc")) + geom_rug(data=genes[genes$strand == "+",], aes(x=start + .5*width), color="darkorchid4", sides="b", alpha=.1, length=unit(0.1, "npc")) + xlab("Chromosomal position of genes (gene start + 0.5 * width)")

We have are basic plot now, but it’s still not very informative as all the data from the different chromosomes are colliding with each other. Let’s go ahead and fix this by making multiple plots from the same data split out by chromosome. Recall from the previous section that there is a very easy way to do this. Also pay particular attention to how the axis are set for each individual plot and produce the result below.

Get a Hint!

look at the ggplot2 documentation for facet_wrap(), particularly the scales parameter in facet_wrap()

What is the code to produce the plot below

ggplot(data=genes, aes(x=start + .5*width)) + geom_density() + geom_rug(data=genes[genes$strand == "-",], aes(x=start + .5*width), color="tomato3", sides="t", alpha=.1, length=unit(0.1, "npc")) + geom_rug(data=genes[genes$strand == "+",], aes(x=start + .5*width), color="darkorchid4", sides="b", alpha=.1, length=unit(0.1, "npc")) + facet_wrap(~seqnames, scales="free") + xlab("Chromosomal position of genes (gene start + 0.5 * width)")

We can start to see some interesting trends now, specifically chr6 appeears to have many genes on the p-arm of the chromosome compared to the q-arm. We can also see a couple regions where strand bias might be present, such as in the beginning of chromosome 15. Let’s go ahead and finish things up by altering some of the theme aspects of the plot. Reproduce the plot below using theme()

What is the code to produce the plot below

ggplot(data=genes, aes(x=start + .5*width,)) + geom_density() + geom_rug(data=genes[genes$strand == "-",], aes(x=start + .5*width), color="tomato3", sides="t", alpha=.1, length=unit(0.1, "npc")) + geom_rug(data=genes[genes$strand == "+",], aes(x=start + .5*width), color="darkorchid4", sides="b", alpha=.1, length=unit(0.1, "npc")) + facet_wrap(~seqnames, scales="free") + theme_bw() + theme(axis.text.y = element_blank(), axis.ticks.y=element_blank(), axis.text.x=element_text(angle=45, hjust=1)) + xlab("Chromosomal position of genes (gene start + 0.5 * width)")

Exercise 3. Gene burden of each chromosome

Let’s try another exercise, what if we don’t care about the density of genes across chromosomes, but instead just want to know which chromosome has the highest gene burden. The code below will produce the data ggplot2 will need to plot this information

geneFreq <- plyr::count(genes$seqnames)
names(geneFreq) <- c("seqnames", "freq")
chrLength <- as.data.frame(seqlengths(TxDb))
chrLength$seqnames <- rownames(chrLength)
colnames(chrLength) <- c("length","seqnames")
chrGeneBurden <- merge(geneFreq, chrLength, all.x=TRUE, by="seqnames")
chrGeneBurden$gene_per_mb <- chrGeneBurden$freq/chrGeneBurden$length * 1000000

Go ahead and try and reproduce the plot below, remember to keep adding layers to get closer to the final product and refer to the ggplot2 documention for help.

Get a Hint!

We dont go step by step for this one, you will need geom_bar(), geom_hline(), geom_text() as the core elements

Get a Hint!

you can make a secondary axis with the scale_y_continuous() layer, you only need one scale_y_continuous()

What is the code to produce the plot below

myChrOrder <- as.character(chrGeneBurden[order(chrGeneBurden$gene_per_mb),]$seqnames)
chrGeneBurden$seqnames <- factor(chrGeneBurden$seqnames, levels=myChrOrder)

ggplot(chrGeneBurden, aes(seqnames, gene_per_mb)) + geom_bar(stat="identity") + geom_hline(yintercept = median(chrGeneBurden$gene_per_mb), linetype=2, color="tomato1") + geom_text(aes(label=freq), angle=-55, hjust=1, nudge_y=.5) + scale_x_discrete(expand=c(.05, .05)) + scale_y_continuous(sec.axis = dup_axis(name="")) + geom_point(aes(y=30, x=1), alpha=0) + theme_bw() + theme(axis.text.x=element_text(angle=45,hjust=1)) + ylab("Genes Per MB")

Exercise 4. Variant allele frequency distributions

Often it is useful to compare tumor variant allele frequencies among samples to get a sense of the tumor purity, heterogeneity, aneuploidy and to determine the existence of sub-clonal cell populations within the tumor. Let’s use the ggplot2ExampleData.tsv dataset we used in the introduction to ggplot2 section to explore this.Run the R code below to make sure you have the data loaded, then try re-creating the plots below. You’ll find hints and answers below each plot.

# load the dataset
variantData <- read.delim("http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv")
variantData <- variantData[variantData$dataset == "discovery",]

Get a hint!

look at geom_violin(), change labels with xlab() and ylab()

What is the code to create the violin plot above?

ggplot() + geom_violin(data=variantData, aes(x=Simple_name, y=tumor_VAF)) + theme(axis.text.x=element_text(angle=90, hjust=1)) + xlab("Sample") + ylab("Variant Allele Fraction")

Looking good, but the plot looks dull, try adding some color to the violin plots and let’s see where the points for the underlying data actually reside.

Get a hint!

Try using geom_jitter() to offset points

What is the code to create the violin plot above?

ggplot(data=variantData, aes(x=Simple_name, y=tumor_VAF)) + geom_violin(aes(fill=Simple_name)) + geom_jitter(width=.1, alpha=.5) + theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="none") + xlab("Sample") + ylab("Variant Allele Fraction")

Finally let’s add some more detail, specifically let’s annotate how many points actually make up each violin. The code below will construct the extra data you’ll need to make the final plot.

# library(plyr)
variantDataCount <- plyr::count(variantData, "Simple_name")
variantDataMax <- aggregate(data=variantData, tumor_VAF ~ Simple_name, max)
variantDataMerge <- merge(variantDataMax, variantDataCount)
head(variantDataMerge)

Get a hint!

You will need to pass variantDataMerge to geom_text()

What is the code to create the violin plot above?

ggplot() + geom_violin(data=variantData, aes(x=Simple_name, y=tumor_VAF, fill=Simple_name)) + geom_jitter(data=variantData, aes(x=Simple_name, y=tumor_VAF), width=.1, alpha=.5) + geom_text(data=variantDataMerge, aes(x=Simple_name, y=tumor_VAF + 5, label=freq)) + theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="none") + xlab("Sample") + ylab("Variant Allele Fraction")

arranging plots with ggplot2* | Griffith Lab

Genomic Visualization and Interpretations

arranging plots with ggplot2*

We’ve gone over the basics of ggplot2 in the previous section, in this section we will go over some tecniques to create multi-panel figures within R.

Creating initial plots

Often when creating figures for publications the figure in the final manuscript is actually composed of multiple figures. You could of course achieve this using third-party programs such as illustrator and inkscape however that means having to manually redo the figure in those programs anytime there is an update or change. Instead of going through that hassel let’s learn how to do it all programatically. To achieve this we will of course make our plots using ggplot, the gtable package to manipulate the plots post-creation, and the gridExtra package to do the actual aligning. Let’s go ahead and get started.

Our first step is to make some plots. We’ll go ahead and use the same dataset as before which is available at http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv as a reminder. Let’s go ahead and load this data in, if it isn’t in R already.

# install the ggplot2 library and load it
library(ggplot2)

# load Data
variantData <- read.delim("http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv")

Next we’ll make a bar chart for the genes which have over 20 mutations in all samples. The commands here should be review from previous sections. If you’re having problems following refer back to the introduction to R and ggplot2 sections.

# install the plyr library
library(plyr)

# count the number of variants for each gene/mutation type and format for ggplot
geneCount <- count(variantData, vars=c("gene_name", "type"))
head(geneCount, 10)
geneCount <- geneCount[geneCount$freq > 20,]

# to order samples for ggplot also get a total count overall (SNPs + InDels)
geneCountOverall <- aggregate(data=geneCount, freq ~ gene_name, sum)
geneOrder <- geneCountOverall[order(geneCountOverall$freq),]$gene_name
geneCount$gene_name <- factor(geneCount$gene_name, levels=geneOrder)

# make the barchart
p1 <- ggplot() + geom_bar(data=geneCount, aes(x=gene_name, y=freq, fill=type), stat="identity") + xlab("Gene") + ylab("Frequency") + scale_fill_manual("Mutation", values=c("#F97F51", "#55E6C1")) + theme_bw() + theme(plot.background = element_rect(color="red", size=2))
p1

We can see from the p1 plot that KMT2D seems interesting, it’s got the most mutations and a fair number of deletions as well. To explore this a bit further lets go ahead and compare KMT2D to a few of the other highly mutated genes from our first plot. To do this we’ll make a boxplot for the genes KMT2D and TNFRSF14 comparing the conservation score. This metric comes from UCSC in our data and is essentially a score from 0-1 of how conserved a region is across vertebrates. A high score would mean the region is highly conserved and probably an important region of the genome.

geneCompare1 <- variantData[variantData$gene_name %in% c("KMT2D", "TNFRSF14"),]
geneCompare1 <- geneCompare1[,c("gene_name", "trv_type", "ucsc_cons")]
geneCompare1 <- geneCompare1[geneCompare1$trv_type == "missense",]
geneCompare1$ucsc_cons <- as.numeric(as.character(geneCompare1$ucsc_cons))
p2 <- ggplot() + geom_boxplot(data=geneCompare1, aes(x=gene_name, y=ucsc_cons, fill=gene_name)) + scale_fill_manual("Gene", values=c("#e84118", "#e1b12c")) + theme_bw() + xlab("Gene") + ylab("Conservation\nscore") + theme(plot.background = element_rect(color="dodgerblue", size=2))
p2

And we’ll go ahead and do the same thing for KMT2D and BCL2.

geneCompare2 <- variantData[variantData$gene_name %in% c("KMT2D", "BCL2"),]
geneCompare2 <- geneCompare2[,c("gene_name", "trv_type", "ucsc_cons")]
geneCompare2 <- geneCompare2[geneCompare2$trv_type == "missense",]
geneCompare2$ucsc_cons <- as.numeric(as.character(geneCompare2$ucsc_cons))
p3 <- ggplot() + geom_boxplot(data=geneCompare2, aes(x=gene_name, y=ucsc_cons, fill=gene_name)) + scale_fill_manual("Gene", values=c("#e84118", "#4cd137")) + theme_bw() + xlab("Gene") + ylab("Conservation\nscore") + theme(plot.background = element_rect(color="green", size=2))
p3

We have our boxplots for missense mutations, it would be nice to know how many data points make up those boxplots as well. To do this we will just create two quick barcharts counting the mutations in the plots defined above.

p4 <- ggplot() + geom_bar(data=geneCompare1, aes(x=gene_name, fill=gene_name)) + scale_fill_manual("Gene", values=c("#e84118", "#e1b12c")) + theme_bw() + theme(plot.background = element_rect(color="darkorange2", size=2)) + xlab("Gene") + ylab("Frequency")
p4

p5 <- ggplot() + geom_bar(data=geneCompare2, aes(x=gene_name, fill=gene_name)) + scale_fill_manual("Gene", values=c("#e84118", "#4cd137")) + theme_bw() + theme(plot.background = element_rect(color="black", size=2)) + xlab("Gene") + ylab("Frequency")
p5

Arranging plots

With our initial plots created wouldn’t it be nice if we could plot these all at once. The good news is that we can. There are a number of packages for available to achieve this. Currently the most widley used are probably gridExtra, cowplot, and egg. In this course we will use gridExtra; before working on our own data, let’s illustrate some basic concepts in gridExtra. Below we will load the grid library in order to create some visual objects to visualize, and use the gridExtra library to arrange these plots. We then create our objects to visualize, grob1-grob4. Theres no need to understand how these objects are created, this is just done to have something intuitive to use when arranging. Our next step is to create the layout for arrangment, we do this by creating a matrix where each unique element in the matrix (1, 2, 3, 4) corresponds to one of our objects to visualize. For example in the layout we use below we have 3 rows and 4 columns in which to place our visualizations. We specify the first row should all be one plot, the second row should be split between plots 2 and 3, and the third row should be split between plots 2 and 4. We then pass grid.arrange our objects to plot and the layout, so in this case grob1 corresponds to the element 1 in the matrix since grob1 is supplied first to grid.arrange.

# make objects to illustrate gridExtra functionality
library(grid)
library(gridExtra)

# make objects to visualize
# you can view these by doing grid.draw(grob1)
grob1 <- grobTree(rectGrob(gp=gpar(fill="#EE5A24", alpha=1)), textGrob("1", gp=gpar(fontsize=28)))
grob2 <- grobTree(rectGrob(gp=gpar(fill="#009432", alpha=1)), textGrob("2", gp=gpar(fontsize=28)))
grob3 <- grobTree(rectGrob(gp=gpar(fill="#0652DD", alpha=1)), textGrob("3", gp=gpar(fontsize=28)))
grob4 <- grobTree(rectGrob(gp=gpar(fill="#833471", alpha=1)), textGrob("4", gp=gpar(fontsize=28)))

# create layout for arrangement and do the arrangement
layout <- rbind(c(1, 1, 1, 1),
                c(2, 2, 3, 3),
                c(2, 2, 4, 4))
grid.arrange(grob1, grob2, grob3, grob4, layout_matrix=layout)

Note how the numbers above in the rbind expression correspond to the layout produced. The first row has object 1 spanning across four columns. The second row is split eveny between object 2 and 3. The third row is split evenly between object 2 and 4.

It is also possible to add empty cells in our layout by inserting NA into our layout matrix. Below we split out grob2 from grob3 and grob4.

layout <- rbind(c(1, 1, 1, 1, 1),
                c(2, 2, NA, 3, 3),
                c(2, 2, NA, 4, 4))

grid.arrange(grob1, grob2, grob3, grob4, layout_matrix=layout)

We can also adjust the size of element relative to our layout matrix. For example, below we say that each row (there are 3), should take up 20%, 40% and 40% of the arranged plot respectively.

layout <- rbind(c(1, 1, NA, 1, 1),
                c(2, 2, NA, 3, 3),
                c(2, 2, NA, 4, 4))

grid.arrange(grob1, grob2, grob3, grob4, layout_matrix=layout, widths=c(.2, .2, .1, .3, .2), heights=c(.2, .4, .4))

There is much more information on how gridExtra works in the various gridExtra vignettes, obviously it is a powerful package for arranging plots. Let’s try an exercise to reinforce the concepts we’ve just learned. Try recreating the plot below:

Get a hint!

the layout matrix should be 5 columns by 4 rows

solution

The solution is in solution.R

Now that we understand the basics of how gridExtra works let’s go ahead and make an attempt at a multi-panel figure. We’ll put the main barchart on top, and match the boxplots with their specific barcharts on rows 2 and 3 of a layout. At the end you should see something like the figure below. Note that for this exercise each plot has a colored border to help visualize the layout. To make the plot cleaner we would drop that by removing the “theme(plot.background = element_rect()” entries above when each plot is generated.

layout <- rbind(c(1, 1),
                c(2, 3),
                c(4, 5))
grid.arrange(p1, p4, p5, p2, p3, layout_matrix=layout)

More advanced topics related to this section are included in the course appendix here.