Welcome to the blog


My thoughts and ideas

Introduction to R Markdown | Griffith Lab

Genomic Visualization and Interpretations

Introduction to R Markdown

A useful feature within the R ecosystem is R Markdown. R Markdown (or .Rmd) files allow a user to intersperse notes with code providing a useful framework for sharing scientific reports in a transparent and easily reproduceable way. They combine both markdown syntax for styling notes and Rscripts for running code and producing figures. Reports can be output in a variety of file formats including HTML documents, PDF documents, and even presentation slides.

Installing R Markdown

  • To start let’s make a simple R Markdown file with Rstudio, you will need to install the R Markdown package from cran.
    # install R Markdown

R Markdown basics

  • Once R Markdown has been installed we can select File -> New File -> R Markdown to create an intial rmarkdown template.
  • Rstudio will ask you to choose an output format, and to add a title and author, for now we will just use the default HTML format however this can be changed at any time within the Rmarkdown template.
  • Go ahead and select okay when you have added your name and a title.

Rstudio should now have made a template for us, let’s go over a few introductory topics related to this template. At the top of the file you will see what looks like a YAML header denoted by ---. This is where the defaults for building the file are set.

You will notice that R Markdown has pre-populated some fields based on what we supplied when we initalized the markdown template. You can output the R Markdown (.Rmd) document using the Knit button in the top left hand corner of RStudio. This is the same as calling the function render() which takes the path to the R Markdown file as input. This file should end in a .Rmd extension to denote it as an rmarkdown file, though Rstudio will take care of this for you the first time you hit Knit.

Rstudio also has a convenient way to insert code using the insert button to the right. You might notice that not only does rmarkdown support R, but also bash, python and a few other languages as well. Though in order to work, these languages will need to be installed before using Knit.

  • Go ahead and hit the Knit button just to see what an R Markdown output looks like with the default example text. If you are working with the default HTML option the result will load in a new RStudio window with the option to open it in your usual web browser.

Note: you use “include=FALSE to have the chunk evaluated, but neither the code nor its output displayed.

Creating a report

Now that we’ve gone over the basics of R Markdown let’s create a real (but simple) report. First, you’ll need to download the Folicular Lymphoma data set we used in the previous ggplot2 section. Go ahead and download that dataset from http://www.genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv if you don’t have it.

R Markdown documents combine text and code, the text portion of these documents use a lightweight text markup language known as markdown. This allows text to be displayed in a stylistic way on the web without having to write HTML or other web code. We won’t go over all of markdowns features however it will be good to familiarize yourself with this style.

A cheatsheet for the markdown flavor that R Markdown uses can be found by going to help -> Cheatsheets -> R Markdown Cheatsheets.

As we have mentioned you can insert a code chunk using the insert button on the top right. For example as shown below when selecting insert -> R, we get a code chunk formatted for R. However you can also add parameters to this code chunk to alter it’s default behavior. A full list of these parameters is available here.


We have created a preliminary rmarkdown file you can download here.

Fill in this document to make it more complete, and then knit it together. The steps you should follow are outlined right in this R Markdown document. You can open this file in RStudio by going to File -> Open File. An R Markdown reference is available here.

Get a hint!

Look at the R Markdown reference guide mentioned above or the cheatsheets in Rstudio.


Here is a more complete .Rmd file.

Introduction to ggplot2 | Griffith Lab

Genomic Visualization and Interpretations

Introduction to ggplot2

There are at least three primary graphics programs available within the R environment. A package for base R graphics is installed by default and provides a simple mechanism to quickly create graphs. lattice is another graphics package that attempts to improve on base R graphics by providing better defaults and the ability to easily display multivariate relationships. In particular, the package supports the creation of trellis graphs - graphs that display a variable or the relationship between variables, conditioned on one or more other variables. Finally, ggplot2 is a graphics program based on the grammar of graphics idealogy, and will be the primary focus of this course.

In this module, we will explore basic use of ggplot2 to plot genomic data. For illustration, we will use a set of mutation data from Supplemental Table S5 of the paper “Recurrent somatic mutations affecting B-cell receptor signaling pathway genes in follicular lymphoma”. You can download a cleaned up version of Supplemental Table S5 at http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv

Introducing ggplot2 syntax

ggplot is based on a system of layering graphical objects to create a final plot. We will start by installing and loading the ggplot2 library. Next, it is important to know that ggplot expects the data passed to it to be of class data.frame. After importing our data (‘ggplot2ExampleData.tsv’), we will modify this data frame to include a ‘coverage’ (tumor_COV) variable. Then we can call the variantData data frame in our ggplot() function and compare the coverage variable to the variant allele frequency (tumor_VAF).

# install the ggplot2 library and load it

# load Supplemental Table S5
variantData <- read.delim("http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv")

#Familiarize yourself with the data in this file by looking at the 'head' (top) of the file

# make a coverage column since this doesn't exist yet
variantData$tumor_COV <- variantData$tumor_ref_count + variantData$tumor_var_count

# start making the plot
p1 <- ggplot(data=variantData, aes(x=tumor_VAF, y=tumor_COV))

Geometric objects and aesthetic mapping

The object stored in variable p1 will generate a blank plot in the bottom right “Plots” window of Rstudio. We invoked ggplot with the function ggplot() and specified the data frame we are trying to plot. We then supplied aesthetic mappings with aes(). In essence this is specifying which columns ggplot should assign to the geometric objects aesthetics. In this specific case, we are telling ggplot that the data is in the data frame “variantData”, the column tumor_VAF should be plotted along the x-axis, and tumor_COV should be plotted along the y-axis. ggplot has determined the axis scales, given the ranges in the data supplied. However you will notice that nothing is actually plotted. At this point we have passed what data we want plotted however we have not specified how it should be plotted. This is what the various geometric objects in ggplot are used for (e.g. geom_point() for scatterplots, geom_bar() for bar charts, etc). These geometric objects are added as plot layers to the ggplot() base command using a +.

# add a point geom object to the plot (method 1)
p1 <- p1 + geom_point()

# The following is equivalent to above (method 2)
p2 <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=tumor_COV))

Both plot p1 and plot p2 generate a scatter plot, comparing the column tumor_COV on the y-axis to the column tumor_VAF on the x-axis. While the plots generated by the p1 and p2 variables may appear identical, we should briefly address their differences. In method 1 (plot p1), we invoke ggplot() with a data frame (variantData) and an aesthetic mapping of tumor_VAF and tumor_COV for x and y respectively. In this method the information is passed to all subsequent geometric objects and is used as appropriate in those objects. In this case, the only geometric object we include is geom_point(). The geom_point() layer is then added using the information passed from ggplot(). Conversely, in method 2 (plot p2), we invoke ggplot() without defining the data or aesthetic mapping. This information is specified directly in the geom_point() layer. If any additional geometric objects were added as layers to the plot, we would specifically have to define the data and aesthetics within each additional layer. This is especially useful when plotting multiple datasets on the same plot (we will explore this later on).

We should also note that geometric objects can behave differently, depending upon whether the plotted variables are continuous or discrete. In the example below (plot p3), we can see that the points have been binned by chromosome name on the x-axis, while the numeric values sorted in the column “tumor_VAF” are plotted along a continuous y-axis scale. The position of the points (specified by position=’jitter’ in the geom_point() object) shifts the points apart horizontally to provide better resolution. A complete list of available geoms within ggplot is available here.

# illustrating the difference between continous and discrete data
p3 <- ggplot() + geom_point(data=variantData, aes(x=chromosome_name, y=tumor_VAF), position="jitter")

Axis scaling and manipulation

Going back to our original example (plot p1), the majority of points look like they have a coverage < 500x. However, there are outliers in this data set causing the majority of points to take up a relatively small portion of the plot. We can provide more resolution to this by the following methods:

  1. limiting the y-axis scale using scale_y_continous() or ylim()
  2. transforming the numeric values by log2() on the y-axis.
  3. transforming the y-axis to a log2 scale by specifying trans within the scale_y_continous() layer.

Note that these transformations can be applied to the x axis as well (scale_x_continous(), xlim(), etc.), as long as the x-axis is mapped to data appropriate for a continuous scale.

# method 1, set y limits for the plot
p4 <- p1 + scale_y_continuous(limits=c(0, 500))
# alternatively, the following is a shortcut method for the same
p4 <- p1 + ylim(c(0, 500))

# method 2, transform the actual values in the data frame
p5 <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=log2(tumor_COV)))

# method 3, transform the axis using ggplot
p6 <- p1 + scale_y_continuous(trans="log2")

Note that adjusting the scale_y_continuous() layer will plot only the points within the specified range by setting the limits. ylim() is a shortcut that achieves the same thing. You’ll see a warning when doing this, stating that rows are being removed from the data frame that contain values outside of the specified range. There is an “out of bounds” parameter within scale_y_continuous() to control what happens with these points, but this isn’t necessarily the best method for this particular plot. In method 2 (plot p5), we actually transform the data values themselves by applying a log2 transform. This method allows us to better visualize all of the points, but it is not intuitive to interpret the log2 of a value (tumor coverage). Alternatively, method 3 (plot p6) does not transform the values, but adjusts the scale the points are plotted on and does a better job of displaying all of our data without having to convert the log2 values.

Applying different aesthetics

While these plots look pretty good, we can make them more aesthetically pleasing by defining the color of the points within the aesthetic. We can specify a color by either the hex code (hex codes explained) or by naming it from R’s internal color pallette, a full list of which is available here. Alternatively, you can list colors by typing colors() in the R terminal.

# list colors in R

# what happens when we try to add color within the aesthetic?
p7 <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=tumor_COV, color="#68228B")) + scale_y_continuous(trans="log2")

# and what happens when we try to add color within the geom?
p8 <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=tumor_COV), color="#68228B") + scale_y_continuous(trans="log2")

Above we chose “darkorchid4” which has a hex value of “#68228B”. However the points in the first plot (p7) are red and not the expected purple color. Our points are appearing miscolored based upon how ggplot is interpreting the aesthetic mappings. When the color aesthetic is specified for geom_point it expects a factor variable by which to color points. If we wanted to, for example, color all missense variants with one color, nonsense variants with another color, and so on we could supply a factor for variant type to the color aesthetic in this way. But, when we specified a quoted hex code, ggplot assumed we wanted to create such a factor with all values equal to the provided text string. It did this for us and then used its internal color scheme to color that variable all according to the single category in the factor variable. By specifying the color outside the aesthetic mapping, geom_point knows to apply the color ‘darkorchid4’ to all of the points specified in the geom_point() layer (p8).

Building on the above concept, we can utilize the colour aesthetic to more specifically visualize our data. For example, what if we wanted to know if the ‘discovery’ or ‘extension’ cohorts within our data (specified by the ‘dataset’ variable) had a higher tumor purity? We will use geom_density() to plot a density kernel of the tumor VAF values, but colour the cohort based upon the dataset subsets. As described above, we will supply a factor the colour aesthetic.

# get a density curve of tumor vafs
p9 <- ggplot() + geom_density(data=variantData, aes(x=tumor_VAF, color=dataset))

# let's add a bit more detail
p10 <- ggplot() + geom_density(data=variantData, aes(x=tumor_VAF, fill=dataset), alpha=.75, colour="black", adjust=.5)

# and let's change the colors some more
p11 <- p10 + scale_fill_manual(values=c("discovery"="#a13242", "extension"="#1a2930"))

In the p9 plot, we told the geom_density() layer to differentiate the data based upon the ‘dataset’ column using the colour aesthetic. We see that our result contains two density curves that use two different colored lines to specify our datasets. In p10, we are telling our geom_density() layer to differentiate the datasets using the “fill” aesthetic instead. We globally assign the line colour (“black”) and the fill transparency (alpha=0.75). In addition, we utilize the adjust parameter to reduce the smoothing geom_density() uses when computing it’s estimate. Now, our datasets are specified by the fill (or filled in color) of each density curve. In p11 (shown above), we append the scale_fill_manual() layer to manually define the fill colours we would like to appear in the plot.

As an exercise, try manually changing the line colors in p9 using a similar method as that used in p11.

Get a hint!

look at scale_colour_manual()


ggplot() + geom_density(data=variantData, aes(x=tumor_VAF, color=dataset)) + scale_color_manual(values=c("discovery"="#a13242", "extension"="#1a2930"))


Depending on the geometric object used there are up to 9 ways to map an aesthetic to a variable. These are with the x-axis, y-axis, fill, colour, shape, alpha, size, labels, and facets. Faceting in ggplot allows us to quickly create multiple related plots at once with a single command. There are two facet commands, facet_wrap() will create a 1 dimensional sequence of panels based on a one sided linear formula. Similarly facet_grid() will create a 2 dimensional grid of panels. Let’s try and answer a few quick questions about our data using facets.

# what is the most common mutation type among SNP's
p12 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type))

# what is the relation of tiers to mutation type
p13 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type, fill=tier))

# which reference base is most often mutated
p14 <- p13 + facet_wrap(~reference)

# which transitions and transversions occur most frequently
p15 <- p14 + facet_grid(variant ~ reference)

ggplot themes

Almost every aspect of a ggplot object can be altered. We’ve already gone over how to alter the display of data but what if we want to alter the display of the non data elements? Fortunately there is a function for that called theme(). You’ll notice in the previous plot some of the x-axis names are colliding with one another, let’s fix that and alter some of the theme parameters to make the plot more visually appealing.

# recreate plot p13 if it's not in your environment
p16 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type, fill=tier)) + facet_grid(variant ~ reference)

# load in a theme with a few presets set
p17 <- p16 + theme_bw()

# put the x-axis labels on a 45 degree angle
p18 <- p17 + theme(axis.text.x=element_text(angle=45, hjust=1))

# altering a few more visual apects
p19 <- p18 + theme(legend.position="top", strip.text=element_text(colour="white"), strip.background=element_rect(fill="black"))

# Let's remove the y-axis ticks as well
p20 <- p19 + theme(axis.title.x=element_blank())

Let’s take a few minutes to discuss what is going on here. In p17, we’ve used theme_bw(), this function just changes a series of values from the basic default theme(). There are many such “complete themes” in ggplot and a few external packages as well containing additional “complete themes” such as ggtheme. In p18, we alter the axis.text.x parameter, we can see from the documentation that axis.text.x inherits from element_text() which is just saying that any parameters in element_text() also apply to axis.text.x. In this specific case we alter the angle of text to 45 degrees, and set the horizontal justification to the right. In p19 we change the position of the legend, change the colour of the strip.text, and change the strip background. Finally in p20 (shown above) we remove the x-axis label with element_blank(), which will draw nothing.

Changing the order of aesthetic mappings

In ggplot the order in which a variable is plotted is determined by the levels of the factor for that variable. We can view the levels of a column within a dataframe with the levels() command and we can subsequently change the order of those levels with the factor() command. We can then use these to change the order in which aesthetic mappings such as plot facets and discrete axis variables are plotted. Lets look at an example using the previous faceted plots we made (p20).

# recreate plot p20 if it's not in your environment
p20 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type, fill=tier)) + facet_grid(variant ~ reference) + theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1), legend.position="top", strip.text=element_text(colour="white"), strip.background=element_rect(fill="black"), axis.title.x=element_blank())

# view the order of levels in the reference and trv_type columns

# reverse the order of the levels
variantData$reference <- factor(variantData$reference, levels=rev(levels(variantData$reference)))
variantData$trv_type <- factor(variantData$trv_type, levels=rev(levels(variantData$trv_type)))

# view plot p20 with the new order of variables
p20 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type, fill=tier)) + facet_grid(variant ~ reference) + theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1), legend.position="top", strip.text=element_text(colour="white"), strip.background=element_rect(fill="black"), axis.title.x=element_blank())

We can see that reversing the order of levels in the reference column has subsequently reversed the reference facets (right side of plot). Similarily reversing the order of the trv_type column levels has reversed the labels on the x-axis.

Saving ggplot2 plots

To save a plot or any graphical object in R, you first have to initalize a graphics device, this can be done with pdf(), png(), svg(), tiff(), and jpeg(). You can then print the plot as you normally would and close the graphics device using dev.off(). Alternatively ggplot2 has a function specifically for saving ggplot2 graphical objects called ggsave(). A helpfull tip to keep in mind when saving plots is to allow enough space for the plot to be plotted, if the plot titles for example look truncated try increasing the size of the graphics device. Changes in the aspect ratio between a graphics device height and width will also change the final appearance of plots.

# save the last plot we made.
pdf(file="p20.pdf", height=8, width=11)

# alternatively ggsave will save the last plot made
ggsave("p20.pdf", device="pdf")

Wide vs long format

In some cases, ggplot may expect data to be in ‘long’ instead of ‘wide’ format. This simply means that instead of each non-id variable having it’s own column there should be a column/columns designating key/value pairs. The long format is generally required when grouping variables, for example stacked bar charts. We can change between wide and long formats with the dcast() and melt() functions from the reshape2 package.

Consider the following example. The Orange dataset that is preloaded in your R install is in wide format and we can create a scatterplot of the data with the code below.

ggplot(Orange, aes(x=age, y=circumference)) + geom_point()

However if we want to group variables (stratify by some factor), such as when variables share a mapping aesthetic (i.e. using color to group variables age and circumference) we must convert the data to long format.

Orange2 <- melt(data=Orange, id.vars=c("Tree"))
ggplot(Orange2, aes(x=value, fill=variable)) + geom_density()

ggplot2 Practice examples

Now that we’ve had an introduction to ggplot2 let’s try a few practice examples. In the section below we will provide instructions for loading and manipulating a dataset, a plot will then be provided and we ask that you attempt to recreate it. The boxes below will give the answers.

Often it is useful to compare tumor variant allele frequencies among samples to get a sense of the tumor purity and to determine the existense of sub-clonal populations among the tumor. Let’s use the ggplot2ExampleData.tsv dataset we’ve been using to explore this. Run the R code below to make sure you have the data loaded, then try re-creating the plots below. You’ll find hints and answers below each plot.

# load the dataset
variantData <- read.delim("http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv")
variantData <- variantData[variantData$dataset == "discovery",]

Get a hint!

look at geom_violin(), change labels with xlab() and ylab()

What is the code to create the violin plot above?

ggplot() + geom_violin(data=variantData, aes(x=Simple_name, y=tumor_VAF)) + theme(axis.text.x=element_text(angle=90, hjust=1)) + xlab("Sample") + ylab("Variant Allele Fraction")

Looking good, but the plot looks dull, try adding some color to the violin plots and let’s see where the points for the underlying data actually reside.

Get a hint!

Try using geom_jitter() to offset points

What is the code to create the violin plot above?

ggplot(data=variantData, aes(x=Simple_name, y=tumor_VAF)) + geom_violin(aes(fill=Simple_name)) + geom_jitter(width=.1, alpha=.5) + theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="none") + xlab("Sample") + ylab("Variant Allele Fraction")

Finally let’s add some more detail, specifically let’s annotate how many points actually make up each violin. The code below will construct the extra data you’ll need to make the final plot.

variantDataCount <- count(variantData, "Simple_name")
variantDataMax <- aggregate(data=variantData, tumor_VAF ~ Simple_name, max)
variantDataMerge <- merge(variantDataMax, variantDataCount)

Get a hint!

You will need to pass variantDataMerge to geom_text()

What is the code to create the violin plot above?

ggplot() + geom_violin(data=variantData, aes(x=Simple_name, y=tumor_VAF, fill=Simple_name)) + geom_jitter(data=variantData, aes(x=Simple_name, y=tumor_VAF), width=.1, alpha=.5) + geom_text(data=variantDataMerge, aes(x=Simple_name, y=tumor_VAF + 5, label=freq)) + theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="none") + xlab("Sample") + ylab("Variant Allele Fraction")