Griffith Lab

Introduction to R Markdown

0007-03-01T00:00:00+00:00

A useful feature within the R ecosystem is R Markdown. R Markdown (or .Rmd) files allow a user to intersperse notes with code providing a useful framework for sharing scientific reports in a transparent and easily reproduceable way. They combine both markdown syntax for styling notes and Rscripts for running code and producing figures. Reports can be output in a variety of file formats including HTML documents, PDF documents, and even presentation slides.

Installing R Markdown

To start let’s make a simple R Markdown file with Rstudio, you will need to install the R Markdown package from cran.
```
# install R Markdown
install.packages("rmarkdown")
```

R Markdown basics

Once R Markdown has been installed we can select File -> New File -> R Markdown to create an intial rmarkdown template.

Rstudio will ask you to choose an output format, and to add a title and author, for now we will just use the default HTML format however this can be changed at any time within the Rmarkdown template.
Go ahead and select okay when you have added your name and a title.

Rstudio should now have made a template for us, let’s go over a few introductory topics related to this template. At the top of the file you will see what looks like a YAML header denoted by ---. This is where the defaults for building the file are set.

You will notice that R Markdown has pre-populated some fields based on what we supplied when we initalized the markdown template. You can output the R Markdown (.Rmd) document using the Knit button in the top left hand corner of RStudio. This is the same as calling the function render() which takes the path to the R Markdown file as input. This file should end in a .Rmd extension to denote it as an rmarkdown file, though Rstudio will take care of this for you the first time you hit Knit.

Rstudio also has a convenient way to insert code using the insert button to the right. You might notice that not only does rmarkdown support R, but also bash, python and a few other languages as well. Though in order to work, these languages will need to be installed before using Knit.

Go ahead and hit the Knit button just to see what an R Markdown output looks like with the default example text. If you are working with the default HTML option the result will load in a new RStudio window with the option to open it in your usual web browser.

Note: you use “include=FALSE to have the chunk evaluated, but neither the code nor its output displayed.

Creating a report

Now that we’ve gone over the basics of R Markdown let’s create a real (but simple) report. First, you’ll need to download the Folicular Lymphoma data set we used in the previous ggplot2 section. Go ahead and download that dataset from http://www.genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv if you don’t have it.

R Markdown documents combine text and code, the text portion of these documents use a lightweight text markup language known as markdown. This allows text to be displayed in a stylistic way on the web without having to write HTML or other web code. We won’t go over all of markdowns features however it will be good to familiarize yourself with this style.

A cheatsheet for the markdown flavor that R Markdown uses can be found by going to help -> Cheatsheets -> R Markdown Cheatsheets.

As we have mentioned you can insert a code chunk using the insert button on the top right. For example as shown below when selecting insert -> R, we get a code chunk formatted for R. However you can also add parameters to this code chunk to alter it’s default behavior. A full list of these parameters is available here.

Exercises

We have created a preliminary rmarkdown file you can download here.

Fill in this document to make it more complete, and then knit it together. The steps you should follow are outlined right in this R Markdown document. You can open this file in RStudio by going to File -> Open File. An R Markdown reference is available here.

Get a hint!

Look at the R Markdown reference guide mentioned above or the cheatsheets in Rstudio.

Answer

Here is a more complete .Rmd file.

Introduction to shiny

0007-02-01T00:00:00+00:00

Interactive graphics is an emerging area within R. There are many libraries available to make interactive visualizations, however most of these libraries are still quite new. In this sub-module we will give a brief overview of shiny, a web application framework within R for building interactive web pages. Using shiny we will build a simple application to display our data using reactive data sets and ggplot.

Install shiny

The shiny package is available on cran and is fairly easy to install using install.packages(). Go ahead and install and load the package. The package comes with 11 example apps that can be viewed using the runExample() function, we will be building our own app from scratch, but feel free to try out a few of these examples to get a feel for what shiny can do. Shiny also provides a nice gallery of example applications and even a genomics example plotting cancer genomics data in a circos-style application.

# install and load shiny
install.packages("shiny")
library(shiny)

# list the built in shiny app examples
runExample()

# run one of these examples in Rstudio
runExample("06_tabsets")

What shiny is actually doing here is converting the R code to html pages and serving those on a random port using the ip address 127.0.0.1 which is localhost on most computers. In simplified terms these html pages are simply being hosted by your own computer. If you are in Rstudio your web application should have been opened automatically, however you can also view these with any modern web browser by going to the web address listed after calling runExample(). It should look something like this: http://127.0.0.1:4379.

After checking it out, use the escape key to stop the shiny app.

Structure of a shiny app

The basic code to run any shiny app is split into two parts: the server (e.g., server.R) and user interface (e.g., ui.R). The server script is the back end of our shiny web app and contains the instructions to build the app. The user interface script is the front end and is essentially what a user views and interacts with. Both of these files should be in the same directory for the app to work properly.

Go ahead and make a folder for our shiny app called “testApp”.

Next create the following two scripts there: ui.R and server.R. This is the bare minimum for a shiny app and will generate an empty web application.

Put the following in a file: ui.R

# load shiny library
library(shiny)

# set up front end
shinyUI(fluidPage(
))

Put the following in a file: server.R

# load shiny library
library(shiny)

# set up back end
shinyServer(function(input, output) {
})

To view/test your app simply type the runApp(port=7777) command in your R/Rstudio terminal. For convenience in this tutorial, we have selected a specific port instead of letting shiny choose one randomly.

Make sure that your current working directory in R is set to the top level of “testApp” where you put server.R and ui.R.

You can use getwd() and setwd() to print and set this respectively.

Example:

getwd()
setwd("/Users/mgriffit/Desktop/testApp")
getwd()
runApp(port=7777)

If successful, Rstudio will display a new window with your application running. Alternatively you can view your app in a web browser at http://127.0.0.1:7777. So far, all you should see is an empty page.

Loading data into the shiny back end (server)

Now that we’ve got a basic frame work up let’s go ahead and load some data and answer a few questions. The data we will use is supplemental table 6 from the paper “Comprehensive genomic analysis reveals FLT3 activation and a therapeutic strategy for a patient with relapsed adult B-lymphoblastic leukemia.”. The data contains variant allele frequency (VAF) values from a targeted capture sequencing study of an adult AML patient with 11 samples of various cell populations and timepoints.

You can download the table here. For simplicity, make a “data” directory in your app and place the data file there.

We can load this data into shiny as you would any other data in R. Just be sure to do this in the server.R script and place the code within the unamed function. Add the following to your server.R script to make the data available within the shiny server.

server.R

# load shiny library
library(shiny)

# set up back end
shinyServer(function(input, output) {
    # load the data
    amlData <- read.delim("data/shinyExampleData.tsv")
})

Sending output to the shiny front end (UI)

Now that we have data let’s make a quick plot showing the distribution of VAF for the normal skin sample (Skin_d42_I_vaf) in comparison to the initial tumor marrow core sample (MC_d0_clot_A_vaf) and send it to the app’s user interface.

We’ll need to first create the plot on the back end (i.e. server.R). We can use any graphics library for this, but here we use ggplot2.

In order to be compatible with the shiny UI we call a Render function, in this case renderPlot() which takes an expression (i.e. set of instructions) and produces a plot. The curly braces in renderPlot() just contain the expression used to create the plot and are useful if the expression takes up more than one line. The renderPlot() will do some minimal pre-processing of the object returned in the expression and store it to the list-like “output” object.

Notice that in the ui.R file we have added a mainPanel() which, as it sounds, is instructing the app to create a main panel on the user interface. Now that we have somewhere to display our plot we can link what was created on the back end to the front end. This is done with the Output family of functions, in this case our output is a plot generated by renderPlot() and is stored in the list like output object as “scatterplot” created in the server.R file.

We use plotOutput() to provide this link to the front end and give the output ID, which is just the name of the object stored in the output-like list.

Note that when providing this link the type of object created with a Render function must correspond to the Output function, in this example we use renderPlot() and plotOutput() but other functions exist for other data types such as renderText() and textOuput().

ui.R

# load shiny library
library(shiny)

# set up front end
shinyUI(fluidPage(
    mainPanel(plotOutput("scatterPlot"))
))

server.R

# load shiny library
library(shiny)

# set up back end
shinyServer(function(input, output) {
    # load the data
    amlData <- read.delim("data/shinyExampleData.tsv")

    # construct a plot to show the data
    library(ggplot2)
    output$scatterPlot <- renderPlot({
        p1 <- ggplot(amlData, aes(x=Skin_d42_I_vaf, y=MC_d0_clot_A_vaf)) + geom_point()
        p1
    })
})

Once again, to view/test your app simply type the runApp(port=7777) command in your R/Rstudio terminal and go to http://127.0.0.1:7777.

This should happen automatically from Rstudio. If your previous app is still running you may need to stop and restart it and/or refresh your browser. You should now see a ggplot graphic in your browser (see below). But, so far, nothing is interactive about this plot. We will allow some basic user input and interactivity in the next section.

Now using what we’ve learned so far try to add some text to are web app by passing it from the back end to the front end.

Get a hint!

Look at the help for textInput textOutput() and renderText renderText()

Solution

These files contain a correct answer: ui.R, server.R

When you’ve completed the above exercise try and answer a few of the questions below.

Why would you want to pass text from the backend to the frontend as opposed to just rendering it in the front end

By passing the text from the backend we have the ability to make the text reactive, i.e. it could change based on what the web app is displaying.

If you did not care if the text was reactive what could you do?, try adding some text by only modifying the ui.R file.

You could simply use any of the html builder functions present in the shiny package, one that would work is p()

Sending input from the front end

Now that we know how to link output from the back end to the front end, let’s do the opposite and link user input from the front end to the back end. Essentially this is giving the user control to manipulate user interface objects. Specifically let’s allow the user to choose which sample Variant Allele Fraction (VAF) columns in the data set to plot on the x and y axis of our scatter plot.

Let’s start with the ui.R file. Below, we have added the sidebarLayout() schema which will create a layout with a side bar and a main panel. Within this layout we define a sidebarPanel() and a mainPanel(). Within the sidebarPanel() we define two drop down selectors with selectInput().

Importantly, within these functions we assign an inputId which is what will be passed to the back end. On the back end side (server.R) we’ve already talked about output within the unnamed function, a second argument exists called “input”. This is the argument used to communicate from the front end to the back end and in our case it holds the information passed from each selectInput() call with the id’s “x_axis” and “y_axis”.

To make our plot reactively change based on this input we simply call up this information within the ggplot call.

You might have noticed that we are using aes_string() instead of aes(). This is only necessary because “input$x_axis” and “input$y_axis” are passed as strings and as such we need to let ggplot know this so the non-standard evalutation typically used with aes() is not performed.

ui.R

#load shiny library
library(shiny)

# define the vaf column names
axis_options <- c("Skin_d42_I_vaf", "MC_d0_clot_A_vaf", "MC_d0_slide_A_vaf", "BM_d42_I_vaf",
                  "M_d1893_A_vaf", "M_d3068_A_vaf", "SB_d3072_A_rna_vaf", "SB_d3072_A_vaf",
                  "BM_d3072_A_vaf", "SL_d3072_I_vaf", "MC_d3107_A_vaf", "BM_d3137_I_vaf",
                  "M_d3219_I_vaf", "BM_d4024_I_vaf")

# set up front end
shinyUI(fluidPage(

  # set up the UI layout with a side and main panel
  sidebarLayout(

    # set the side panel to allow for user input
    sidebarPanel(
      selectInput(inputId="x_axis", label="x axis", choices=axis_options, selected="Skin_d42_I_vaf"),
      selectInput(inputId="y_axis", label="y axis", choices=axis_options, selected="MC_d0_clot_A_vaf")
    ),

    # set the plot panel
    mainPanel(
      plotOutput("scatterPlot")
    )
  )
))

server.R

# load shiny library
library(shiny)

# set up back end
shinyServer(function(input, output) {
  # load the data
  amlData <- read.delim("data/shinyExampleData.tsv")

  # construct a plot to show the data
  library(ggplot2)
  output$scatterPlot <- renderPlot({
    p1 <- ggplot(amlData, aes_string(x=input$x_axis, y=input$y_axis)) + geom_point()
    p1 <- p1 + xlab("Variant Allele Fraction") + ylab("Variant Allele Fraction")
    p1
  })
})

Once again, to view/test your app simply type the runApp(port=7777) command in your R/Rstudio terminal and go to http://127.0.0.1:7777. This should happen automatically from Rstudio. If your previous app is still running you may need to stop and restart it and/or simply refresh your browser. You should now see a ggplot scatterplot graphic in your browser (see below) as before. But, now you should also see user-activated drop-down menus that allow you to select which data to plot and visualize. You have created your first interative shiny application!

Exercises

We have given a very quick overview of shiny, and have really only scraped the surface of what shiny can be used for. Using the knowledge we have already learned however let’s try modifying our existing shiny app.

Right now the plot looks fairly bland. Try adding the ability for the user to enter a column name as text to color points by. For example, try coloring by the column names “Class” or “Clonal.Assignment”. Use your existing ui.R and server.R files as a starting point. If successful, you should be able to restart/refresh your shiny app and see something like the following:

Get a hint!

You will want to use textInput() within the ui.R file for this and then link the input to the ggplot call.

Solution

These files contain the correct answer: ui.R, server.R

Hosting your shiny app on the web

To make your new shiny app accessible on the web you have several options. The simplest is to just sign up for an account at www.shinyapps.io. Once you sign up shinyapps.io will walk you through the process of installing (STEP 1) and authorizing (STEP 2) the rsconnect library (see below).

If set up correctly you will be able to deploy your app (STEP3) with:

library(rsconnect)
rsconnect::deployApp('path/to/your/app')

Alternatively, simply select the ‘Publish’ button in the top-right of a running Shiny App from Rstudio (see below).

Either process should create an app at https://[your_account].shinyapps.io/[yourApp]/ using the name for the account you created at shinyapps.io and the name you set for your App during the publication process. However, the free shinyapps.io account is limited to 5 applications and 25 active hours of runtime (any time your application is not idle). Upgrading to a pay account will increase the allowed numbers of applications, active hours, and add options for authentication.

For a longer-term, do-it-yourself, possibly cheaper solution, you will need a web server with the separate Shiny Server Open Source software running on it, along with with your Shiny App. There are many ways you could set this up. One option would be to do something like the following: (1) Start an Ubuntu linux Amazon AWS instance; (2) Login to your AWS linux box; (3) Install R, the shiny R library, and any other R libraries that your shiny app needs (e.g., ggplot2, rmarkdown, etc); (4) Install and start the shiny-server; (5) Copy your shiny application files (R and Rda) files to the shiny-server folder on your linux server. (6) In a browser, navigate to the public IP address of the linux server. Detailed instructions are available on this blog post. Unfortunately, for authentication (password protection support) you will need to upgrade to the pay version - Shiny Server Pro.

Advanced ggplot2

0007-01-01T00:00:00+00:00

This appendix is a continuation of the arrangingPlots, here we go over some advanced concepts in terms of aligning plot eements and manipulating grob objects. Some of the objects we’ll be working with come from the previously mentioned section so make sure you have that code run!

Aligning plots part 1

Our plot from the arrangingPlots section is looking pretty good, you might notice an unfortunate issue however in that the boxplots don’t align with their respective barcharts. Don’t worry it’s fairly easy to fix in this case, however before we start we need to go down a rabbit hole and obtain a basic understanding of grobs, tableGrobs and viewports.

First off a grob is just short for “grid graphical object” from the low-level graphics package grid; Think of it as a set of instructions for create a graphical object (i.e. a plot). The graphics library underneath all of ggplot2’s graphical elements are really composed of grob’s because ggplot2 uses grid underneath. A TableGrob is a class from the gtable package and provides an easier way to view and manipulate groups of grobs, it is actually the intermediary between ggplot2 and grid. A “viewport” is a graphics region for which describes where a grob or group of grobs is assigned on a graphics device. When we have been calling grid.arrange in our previous examples what we are really doing is arranging viewports which contain groups of grobs.

To Illustrate grobs and viewports a bit further let’s convert our arranged plot to a grob and take a look at it.

grob <- arrangeGrob(p1, p4, p5, p2, p3, layout_matrix=layout)
grob

you’ll notice a couple things right away, the table grob is composed of 5 individual grobs and are arranged in a 3 row, 2 column layout. The z column denotes the order in which grobs are plotted. the cells column is telling us where the grob is located within the viewport. For example the first element has a value of (1-1,1-2). This is telling us that that grob spans from from rows 1 to 1 (1-1) on the viewport and columns 1 to 2 (1-2) on the viewport. This is a bit easier to illustrate by viewing the actual layout with gtable_show_layout().

gtable_show_layout(grob)

After running the command above you should see something like the figure below, (note that i’ve taken the liberty of overalying the output ontop of the original plot). Notice how the first grob is spanning rows 1-1 and columns 1-2.

We can verify that this is correct by drawing just the first grob.

grid.draw(grob$grobs[[1]])
dev.off()

Okay our trip down the rabbit hole is coming to an end, I’ll just mention one last thing. As eluded to already the tableGrob we looked at is just a collection of viewports and those viewports contain grobs. In the grob we looked at we were at the top level and so by default the viewport takes up the entire page. Inside this top level we saw 5 grobs each of which have their own viewports. In the above command we go a layer deeper and draw one grob which itself has viewports it’s own associated viewports for the elements of the plot (legend, axis, etc.).

We glossed over quite a bit of detail in our discussion of grobs, tableGrobs and viewports however I think we know enough to get our plots to align. To start we need to convert all of the plots we made in ggplot to grobs, we can do this with the ggplotGrob() function. Next each viewport in the grob at this level has an associated width, for example the axis title has a width, the axis text etc. We can access these widths within the table grob using tableGrob$widths which will output a vector of these widths. We can then use the unit.pmax() function to find the maximum width for each viewport among all of our plots. From there it’s a simple matter of manually modifying and reassinging the widths for each grob and plotting the results as before.

# convert to grobs
p2_grob <- ggplotGrob(p2)
p3_grob <- ggplotGrob(p3)
p4_grob <- ggplotGrob(p4)
p5_grob <- ggplotGrob(p5)

# align plots
p4_grob_widths <- p4_grob$widths
p5_grob_widths <- p5_grob$widths
p2_grob_widths <- p2_grob$widths
p3_grob_widths <- p3_grob$widths

maxWidth <- unit.pmax(p4_grob_widths, p5_grob_widths, p2_grob_widths, p3_grob_widths)

p4_grob$widths <- maxWidth
p5_grob$widths <- maxWidth
p2_grob$widths <- maxWidth
p3_grob$widths <- maxWidth

layout <- rbind(c(1, 1),
                c(2, 3),
                c(4, 5))
grid.arrange(p1_grob, p4_grob, p5_grob, p2_grob, p3_grob, layout_matrix=layout)

At the end you should see something like the figure below.

Aligning plots part 2

If you poked around the grob a bit you might have noticed that this only works because each plot has an equal number of viewports/grobs all of which have an associated width. What would you do then in a situation where your plots don’t have the same number of viewports. For example what if our boxplots didn’t have a legend. Fortunately there is a simple way around this, let’s start by first removing the legend from our boxplots and converting the resulting plots to grobs. If you take a look at the barchart (p4) and boxplot (p2) table grob you’ll notice that they are now different sizes as expected. The barchart is 12 x 11 and the boxplot is 12 x 9 further we see the boxplot is missing the grob named “guide-box” which corresponds to the legend. We don’t care that the grob is missing neccessarily, in fact it’s what we want, but we do need to add columns to the tableGrob for the boxplot to match the barchart. Examining the grobs we can see that the “guide-box” of the barchart spans columns 9-9 so we should add a place holder column before that at position 8. Further we can see we will actually need to add 2 placeholders, as the barchart has 11 columns and our boxplot has 9. This is because we need a placeholder not only for the legend but the whitespace between the legend and the main plot as well. Fortunately the gTable package has a function to add columns gtable_add_cols, it takes the gTable ojbect to modify, the width of the column to be added, and the position to add the column as arguments. For our purposes we need to specify a width but the actual width doesn’t matter, it just needs to be a valid width as we will be reassigning that width in a minute anyway.

# remove legend from the boxplots
p2 <- p2 + theme(legend.position="none")
p3 <- p3 + theme(legend.position="none")

# and then convert these to grob objects
p2_grob <- ggplotGrob(p2)
p3_grob <- ggplotGrob(p3)

# look at on of the boxplot/barchart grob sets
p2_grob
p4_grob

p2_grob <- gtable_add_cols(p2_grob, widths=unit(1, "null"), pos=8)
p2_grob <- gtable_add_cols(p2_grob, widths=unit(1, "null"), pos=8)

p3_grob <- gtable_add_cols(p3_grob, widths=unit(1, "null"), pos=8)
p3_grob <- gtable_add_cols(p3_grob, widths=unit(1, "null"), pos=8)

From here we can use the same methodology as we employed before to align the plots. You should see something like the figure below

# get the grob width for the new boxplots
p2_grob_widths <- p2_grob$widths
p3_grob_widths <- p3_grob$widths

# find the max width of all elements
maxWidth <- unit.pmax(p4_grob_widths, p5_grob_widths, p2_grob_widths, p3_grob_widths)

# assign this max width to all elements
p4_grob$widths <- maxWidth
p5_grob$widths <- maxWidth
p2_grob$widths <- maxWidth
p3_grob$widths <- maxWidth

# create a layout and plot the result
layout <- rbind(c(1, 1),
                c(2, 3),
                c(4, 5))
finalGrob <- grid.arrange(p1_grob, p4_grob, p5_grob, p2_grob, p3_grob, layout_matrix=layout)
grid.draw(finalGrob)

gTable grob modification

Were almost done with our final plot, there’s just one more thing we’re going to cover. It might have occurred to you that if we can view grobs we can manipulate them and you would be right. Let’s suppose that we want to color the labels in our final plot in a specific way, in particular we want to highlight the genes in the top most plot in red for which we have boxplots. The good new is that we can do this, the trick is to know which grobs and viewports to dig into. As a side note, it is hugely beneficial to use Rstudio when doing this sort of thing to take advantage of the autocompletion feature. To start digging in we need to look at the various grobs and their viewports. We first go into finalGrob$grobs which will print out all grobs at this level as a list. There are 5 one for each of our plots we used with grid.arrange and the first one in the list corresponds to the top plot which we can access with [[]] and draw with grid.draw() to verify. Digging in further through lists of grobs we can finally get to the x axis with grid.draw(finalGrob$grobs[[1]]$grobs[[7]]$children$axis$grobs[[2]]). Going just a bit further we can see that the x-axis has a color of “grey30” and we simply give it a new vector of colors to change the color for each label. At the end you should see something like the plot below:

# figure out the base grob we want to dig into
grid.draw(finalGrob$grobs[[1]])
dev.off()

# access x-axis
grid.draw(finalGrob$grobs[[1]]$grobs[[7]]$children$axis$grobs[[2]])
dev.off()

# access x-axis color
finalGrob$grobs[[1]]$grobs[[7]]$children$axis$grobs[[2]]$children$GRID.text.6880$gp$col

# change the color of the x-axis text
finalGrob$grobs[[1]]$grobs[[7]]$children$axis$grobs[[2]]$children$GRID.text.6880$gp$col <- c("blue", "blue", "blue", "blue", "blue", "red", "red", "blue", "red")

# plot the result
finalGrob <- grid.arrange(p1_grob, p4_grob, p5_grob, p2_grob, p3_grob, layout_matrix=layout)
grid.draw(finalGrob)

Most of the material in here, specifically the modification of gTable objects is advanced and in most cases will probably be uneccessary. But hopefully if you need to modify these types of objects you’ll have a basic understanding of how to go about doing it. We’ve really only scratched the surface of gTable objects as these are low level functions. The thing to remember is that you can modify these objects with some patience and trail and error.

Exercise

Someone has decided they want a purple border around all the legends for our final plot (don’t ask me why). We could of course do this within ggplot but let’s imagine we’ve lost the code for creating the plot and only have the grob object to work with. Follow the instructions below and modify the grob to have this purple border.

Save our finalGrob as exercise1 so we don’t overwrite anything
dig into the newly saved exercise1 and attempt to find where to change the legend border (hint your looking for something called col)
Replace the value currently in col to purple
use grid.draw() to plot the result

solution

The solution is in solution.2.R

Q & A, Discussion, Integrated Assignments, and Working with Your Own Data

0006-01-01T00:00:00+00:00

In this section we provide some additional exercises covering a range of topics to reinforce concepts and topics throughout this course series. We encourage students to attempt to do these exercises on their own. We have provided hints and an answer for each exercise however these should be used only as a last resort, students should first try searching for solutions throughout this course and other available resources throughout the web.

Additional Exercises

In 1854 there was cholera epedemic in the Soho district of London kown as the Golden square outbreak. Ultimately a particularly virulent strain of the disease caused the deaths of 616 individuals. At this time there were two competing theories as to the cause of the outbreak. The commonly held miasma theory postulated that foul air from decaying organic matter was the cause of the disease. A physician by the name of John Snow had published years earlier the competing germ theory, specifically postulating that cholera was caused by the presence of as yet unknown germ cells which contaminated water. The Golden square outbreak allowed John Snow with the help of Henry Whitehead to map the deaths of the outbreak in relation to public water pumps around the area. Eventually this work led to the debunking of miasma theory. In this exercise try and recreate the famous map originally created by John Snow to support his theory, an example of which is shown below. You’ll need to install the package cholera and use the data frames specified below.

topics covered: ggplot2, basic R
difficulty: 3/5

install.packages(cholera)
data(roads)
data(pumps)
data(fatalities.address)
data(pump.case)

Hint!

You shouldn't need to alter the roads dataframe to plot it with ggplot, take a look at the group aesthetic!

Hint

you need to merge the fatalities.address and pump.case data frames but first you'll need to convert pump.case to a data frame, look at the stack() function!

Answer

Download an Rscript with the answer Here.

roads: Data frame providing the x/y coordinates for road start and end points grouped by street.
pumps: Data frame providing coordinates and names for water pumps.
fatalities.address: Data frame providing coordinates for each anchor case address for a case of cholera (i.e. address of the first cholera case at an address)
pump.case: list of vectors associating each anchor case with a water pump id.

Lecture

Module 6 Lecture

CIViC

0005-04-01T00:00:00+00:00

Precision medicine refers to the use of prevention and treatment strategies that are tailored to the unique features of each individual and their disease. In the context of cancer this might involve the identification of specific mutations shown to predict response to a targeted therapy. The biomedical literature describing these associations is large and growing rapidly. Currently these interpretations exist largely in private or encumbered databases resulting in extensive repetition of effort. Realizing precision medicine will require this information to be centralized, debated and interpreted for application in the clinic. CIViC is an open access, open source, community-driven web resource for Clinical Interpretation of Variants in Cancer. Its goal is to enable precision medicine by providing an educational forum for dissemination of knowledge and active discussion of the clinical significance of cancer genome alterations. For more details refer to the 2017 CIViC publication in Nature Genetics.

ClinVar

0005-03-01T00:00:00+00:00

ClinVar aggregates information about genomic variation and its relationship to human health. To date, thousands of variants and associated phenotypes have been deposited in ClinVar. The main use case for ClinVar is when you have one or more variants observed in a human sample and you would like to know if these variants have established clinical relevance. The resource is fairly heavily biased towards germline inherited variants that predispose an individual for disease. Many diseases and phenotypes have been reported. Some of the variant observations in ClinVar come from sources that mine published literature (e.g. DoCM, OMiM, etc.). However, the majority of submissions to ClinVar come from clinical sequencing labs that identify variants in patients and report that variant along with the phenotype of the patient(see Data Sources for more details). Several types of variants are supported in ClinVar. Many are single nucleotide variants (SNVs) or small insertions and deletions, but large structural variants are also supported. In general, all variants in ClinVar are designated using the HGVS nomenclature. ClinVar strongly recommends following the ACMG guidelines in deciding on the clinical significance (pathogenicity) of variants. The following tutorial explores these concepts in greater detail.

Basic intro to the ClinVar interface

ClinVar offers a few video tutorials to get you started: Intro to ClinVar and Find All Variants with ClinVar. The ClinVar website is quite simple. The home page offers various links to frequently used sections. The tool bar at the top of each page includes the simple search bar and various menus to navigate to additional information.

To get a sense of where the data in ClinVar is coming from, visit the complete submitter list (Statistics -> List of Submitters). This list gives you an idea of some of the major players in the genetic diagnostics and variant interpretation.

To get a sense of the current content of ClinVar, visit the Statistics page (Statistics -> Statistics).

If you would like to download raw ClinVar data, you can do so from their FTP site. For example, you can download the complete set of ClinVar variants in VCF format (human genome build 37 VCF or human genome build 38 VCF).

ClinVar has several key entities used to organize variant data. (1) Each submission of an interpretation for a variant is assigned a submission accession of the format SCV000000000.0. Multiple labs could report independent interpretations for the same variant. Each would be entered as a distinct record. If a lab wishes to update that submission, they can do so, and the version number will be updated to reflect this change. (2) When there are multiple submissions about the same variation/condition relationship, they are aggregated under a reference accession of the format RCV000000000.0. If the same variant is associated with multiple conditions, there will be multiple reference accessions. (3) All information about a single distinct variant are organized under one Variation ID and a corresponding Allele ID. Both are simple integers. Each allele corresponds to a distinct genomic variation that must be described using one or more HGVS expressions.

Consider the following example: NM_007294.3(BRCA1):c.5558dupA (p.Tyr1853Terfs)

Variation ID: 55628
Allele ID: 70295
Submission accessions: SCV000300281.2, SCV000329130.4, SCV000488476.1, …
Reference accessions: RCV000074357 (Breast-ovarian cancer, familial 1), …, RCV000163976 (Hereditary cancer-predisposing syndrome)

A more detailed explanation of these identifiers can be found in the Identifiers section of the ClinVar documentation.

Searching ClinVar

ClinVar several search modes. (1) you can simply type free form text in the search box near the top of every page, (2) if you know the neccessary field codes, you can construct complex queries in this same search box, (3) you can use the Advanced Search Builder.

Use the basic search box to find all variants for the gene AKT1. Note that when ClinVar detects a match to a proper gene name in your search, it assumes you want to search against only the gene name field. It indicates this by adding [gene] to your search text. If you wish to search all fields for this term, you can can chose the option to “Search instead for all ClinVar records that mention AKT1”.

Now try searching for a specific disease: proteus syndrome. What is the popular significance of this disease? Note that this search appears to be returning partial matches to various conditions. To get a more refined search, trying placing the condition name in quotes and adding the disease tag: “proteus syndrome”[dis]. Remember that we can find possible tags to use here on the help page. Note the variant NM_005163.2(AKT1):c.49G>A (p.Glu17Lys). This variant has significance in both the germline and somatic contexts. In the germline context, it appears pathogenic for Proteus Syndrome (actually in this case, this involves somatic mosaicism).

Search for variants within a narrow window on a single chromosome: “5[chr] AND 1264532:1264552[chrpos]”. What gene does this region correspond to? What is the significance of this gene?
Use the Advanced Search page to construct the same query. Note that you can use the show index list option to see valid search options/examples for each field.

Use the Advanced Search page to construct a complex query. For example: Variants submitted by “Counsyl, that are “duplication variants”, for the disease “Renal carnitine transport defect”, and with review status “criteria provided, single submitter”.

Obtaining a list of all high quality predisposing variants in ClinVar

If you just want the complete list of high quality variants that you might consider returning results for in a clinical setting. You might decide to peform a query that requires the variant to have been reviewed by an expert panel, belong to a practice guideline, or have been submitted by multiple entities that agree on the interpretation. This list could be further limited to those that are Pathogenic and where the variant origin is stated as “germline”. If you put all of this together the query looks like this:

((("reviewed by expert panel"[Review status]) OR "criteria provided, multiple submitters, no conflicts"[Review status]) OR "practice guideline"[Review status]) AND "clinsig pathogenic"[Properties] AND "germline"[Origin]

Download this complete data set to a tabular (TSV) file and open it is a spreadsheet editor such as Excel.

A brief primer on HGVS

In completing the exercises above in ClinVar you will have noticed that variants have rather complex looking names. This is a deliberate choice on the part of ClinVar. It is common in the literature and in day to day conversations to refer to variants with shorthand names. For example, someone refers to BRAF V600E you may have a good idea what they are referring to. However, such names can be ambigiguous. There are often multiple correct variations in the genome that could lead to a V600E effect in the protein. Furthermore, if you want to design an assay to detect this variant in RNA, cDNA, or genomic DNA, then V600E is not very useful. For this reason, the community has adopted a set of standards for describing variants in a much more precise way that is unambiguous and leads to less confusion. The Human Genome Variation Society (HGVS) acts as a steward for these standards. They have also released detailed guidelines (the rulebook) on the correct way to create HGVS expressions for almost any variant. These guidelines are maintained, updated, and versioned. They are available at: varnomen.hgvs.org. ClinVar also provides a simple overview of the HGVS types they use. Here is an example of the same variant (AKT G17K) represented as HGVS expressions describing it at the:

ClinVar variant name: NM_005163.2(AKT1):c.49G>A (p.Glu17Lys) HGVS for protein: NP_005154.2:p.Glu17Lys HGVS for cDNA: NM_005163.2:c.49G>A HGVS for genome: NC_000014.9:g.104780214C>T (GRCh38)

Each HGVS has several required elements. First you have a sequence accession with a version number (e.g. NM_005163.2). This tells you unambiguously what the reference sequence is (a published protein, cDNA, or genome sequence). Then you have a delimiter “:”. Next you have a letter that indicates the general type of HGVS expression to follow: “p” for protein, “c” for cDNA, “g” for genomic DNA. Finally, you have the expression that describes the sequence variation itself. The rules for these expressions depend on the type of variation (substitution, duplication, deletion, etc.) and the level of HGVS (protein, cDNA, gDNA, RNA, …).

As an exercise. Try manually creating valid HGVS expressions at the genome, cDNA, and protein level for: BRAF V600E. What fundamental tools would we need to do this (assuming we can not just look it up in a database)?

Here are some excellent resources for working with HGVS:

HGVS guidelines

Mutalyzer (tools for validating and converting HGVS expressions)

TransVar (tool for converting HGVS expressions)

MyVariantInfo (aggregates variant info including HGVS expressions from many sources)

A brief primer on the ACMG guidelines for germline variant interpretation

You may have noticed in the exercises above various references to the clinical significance or pathogenicity of variants. In some queries of ClinVar we limited variants to only those that are thought to be Pathogenic (i.e. if inherited, the predispose an individual to a particular disease). How does one establish whether a variant is Pathogenic, or Benign? To help establish how this should be done, the ACMG has released detailed Standards and guidelines for the interpretation of sequence variants. For the full details, one should definitely read that paper and refer back to it extensively as you try to interepret specific variants. Briefly, the ACMG established a variant classification approach that is meant to be applicable to variants in all Mendelian genes. They categorize the types of evidence that one should use to support the pathogenicity of a variant. Each type of evidence is indicated by an evidence code. Evidence codes are broken down into two main categories, those that support a pathogenic interpretation, and those that support a benign interpretation. Within these two broad groups, sub-categories are defined for different types of evidence (e.g. functional data, population frequency, inheritance patterns, algorithmic predictions, etc.). The guidelines describe how to document this evidence and how to use that evidence to make a final conclusion for the pathogenicity of each variant for each disease/phenotype. The ACMG defines five tiers for the pathogenicity of a variant: “pathogenic,” “likely pathogenic,” “uncertain significance,” “likely benign,” and “benign”.

We strongly encourage you to read the ACMG guidelines carefully. The critical summaries are provided in several tables of this paper.

Evidence and codes for classifying pathogenic variants: Table 3
Evidence and codes for classifying benign variants: Table 4
Rules for combining evidence to classify sequence variants: Table 5

Final comment: If used properly, the ACMG guidelines require a lot of evidence to define a variant as pathogenic or even likely pathgenic. Similarly strong evidence is needed to conclude a variant is benign or even likely benign. Many variants are thus initially classified as “uncertain significance”. Note that this does not mean unknown it means uncertain. i.e. lacking certainty. Since variant classifications are meant to be used in a clinical setting the goal is to avoid misinforming patients that may take life changing actions upon hearing these results.

ClinVar practice exercises

How many ClinVar variants are there for BRCA2 that are Germline, Pathogenic and have Expert Panel review status?

Get a hint!

Simply enter BRCA2 in the ClinVar search box, then apply the three filters requested.

Answer

At the time of writing this post, there were 2069 germline, pathogenic, expert panel reviewed BRCA2 variants.

Use ClinVar to find valid HGVS expressions at the genome, cDNA, and protein level for the following variant: PIK3CA H1047R.

Get a hint!

Simply enter PIK3CA H1047R in the ClinVar search box, then review the HGVS section.

Answer

NG_012113.2:g.90775A>G (genome), NM_006218.3:c.3140A>G (cDNA), NP_006209.2:p.His1047Arg (protein).

Find all of the variants in ClinVar that correspond to exon 2 of VHL. How many are there? Export these to a spreadsheet.

Get a hint!

Use a genome browser such as IGV to obtain the build38 coordinates for exon 2 of VHL. Then use the advanced ClinVar search box to search for variants in VHL and within the exon 2 coordinates.

Answer

To get VHL exon 2 variants use this query. (VHL[Gene Name]) AND 10146514:10146636[Base Position]. At the time this question was created there were 517 variants in VHL and 63 of these were within exon 2.

Variant annotation with VEP

0005-02-01T00:00:00+00:00

Often it will be informative to annotate variants with additional information in order to get a sense of a variants impact on a phenotype. One tool that makes this process quick and straightforward is the ensembl Variant Effect Predictor (VEP). This program is available both as a stand alone software program based in perl and as a web based GUI. In this module we will learn how to use VEP in both forms.

Installing perl

MAC and UNIX

In order to use stand alone VEP we will first need to download and install perl, a high level scripting language. First let’s check if you have perl ≥ 5.10 already installed, open a command prompt “terminal” on your local machine and run the code below.

# check perl version
perl -v

If you see a message to the effect of “This is perl 5” then you can ignore the next bit, otherwise you will need to download and install perl. To do this navigate to the perl downloads page at https://www.perl.org/get.html and select get started for your specific operating system.

Then select activeState perl, once the installer is downloaded follow the on screen instructions and check your install with perl -v from a terminal window.

Windows

In order to use VEP on widnows we will first need to download and install a special flavor of perl called DWIMperl. Navigate to http://dwimperl.com/windows.html and download the “Dwimperl-5.14.2.1-v7-32.exe” executable at the bottom of the page. Then run the executable and follow the on screen instructions. Once finished search and open Command Prompt from the “Start Menu” and type perl -v to check that the installation was successful.

Installing VEP

With perl now installed we can go ahead and download and install VEP itself. The recommended way to do this is by cloning the github repo available here. We won’t actually be cloning the repo, though you certainly could. Instead select the green Clone or download button to the right and then select Download Zip. Once the file is downloaded go ahead and unzip it.

Once that is complete you will need to navigate to the directory where you unziped the VEP repo and run the code below.

# change to directory where vep is
cd ./ensembl-vep-release-90

# initate VEP installations
perl INSTALL.pl --NO_HTSLIB --NO_TEST

This will start the VEP installation process. You will be asked if you want to “download local cache files”, “download fasta files”, and “install plugins”. Because of their size we will answer “no” to all these however you can always change this by re-running perl INSTALL.pl.

Finally let’s check our VEP installation, from the same directory you ran perl INSTALL.pl run ./vep --help. If everything went okay you should see some VEP version numbers appear.

Running VEP

As we have mentioned previously VEP can either be run via the command line or through a web GUI. For the remainder of this section we will be doing both side by side however we should note that using VEP from the command line is more flexible and there is a greater range of features.

To start let’s go ahead and annotate the germline variant rs1799966 using the default ensembl input format. Essentially we need to create a file with 5 columns corresponding to “chromosome”, “start” (1-based), “stop” (1-based), “reference/variant”, and “strand” (corresponding to reference/variant). You can optionally have a sixth column to add a unique identifier for this row. After creating the file variant.txt we run VEP with the following options.

NOTE: A common source of confusion is regarding the strand specified as input for VEP. This is not the strand of any specific gene/transcript. Rather it represents the strand of the reference genome and should match the reference/variant alleles specified for each variant. By convention, variant callers typically report all variants relative to the positive strand. However there may be some cases where you are starting with already annotated variants and therefore have c. notation and/or have gene- or transcript-specific strand available. Be careful not to input genomic positions with variant alleles and strand mismatched (e.g, using the positive strand reference and variant allele basesfrom the reference genome together with a negative transcript).

-i input file, the file format is automatically detected
-o output file to write results
--database make queries to public ensembl databases instead of looking for local copies
--species species for which annotations should be obtained
--everything flag to output additional annotations

# make a file with a single variant in
echo "17 43071077 43071077 T/C + variant_1" > variant.txt

# run VEP
./vep -i variant.txt -o variant.anno.txt --database --species "human" --everything

We can do the same thing through the web interface, navigate to the vep homepage at http://www.ensembl.org/info/docs/tools/vep/index.html, and click on Launch VEP.

Then input your variant and click on Run.

This will submit the job to ensembl servers, the page will refresh every few seconds. When the job completes click on view results.

Doing so will take you to a web page where you will be able to view summary statistics, the results, and options to filter or export a file.

You have probably noticed from your exploration of the results that even though only one variant was supplied multiple rows were output. By default VEP returns annotations for each “transcript” and each “variant consequence”. While this is informative it is often desireable to only have one annotation per gene. We can achieve this on our command line by adding the --per_gene parameter. The same thing can be achieved through the web interface by expanding the filtering options tab and setting Restrict results to Show one selected consequence per gene. Go ahead and do that now through either the command line or web GUI.

There are many features through both forms of VEP, to many to cover in it’s entirety for this course. However extensive documentation for the web based version is available here and the stand alone perl script available here.

Exercises

Now that you have an inital VEP result try and answer a few questions about your data. If you need help in understanding a certain column hover over it with your cursor or look at the VEP documentation available here.

What is the maximum allele frequency observed for this vairant in the 1000 genomes european population?

Answer

0.3598

In what gene is this variant located?

Answer

BRCA1

What is the amino acid change for this variant?

Answer

S/G

Variant annotation and interpretation

0005-01-01T00:00:00+00:00

When variants are identified in the genome (or transcriptome) some kind of annotation and need for interpretation invariably follows. There are many, many tools for annotation and interpretation in different contexts and for different purposes. In this section we explore just a few of these many options. First we will learn to use Ensembl’s Variant Effect Predictor (VEP), a popular and widely used variant transcript annotator. VEP has many functions, but it is first used to annotate variants in the context of set of known transcripts. The other resources we will use, ClinVar and CIViC attempt to summarize evidence for the clinical relevance of variants in inherited human diseases and cancer respectively.

Some here are some examples of variant annotation and interpretation contexts:

Population frequency/recurrence (is the variant common, rare, rare?)
Transcripts (does the variant occur within a transcribed region of a gene? Does is affect the predicted translation of that transcript?)
Function (is the variant likely to disrupt the normal function of a gene?). There are many, many approaches to this.
- Conservation of the affected region
- Predicted biochemical significance of amino acid alterations
- Occurence in know functional domains (e.g. the binding pocket of a kinase)
- Hot spots of variantion (some patterns can suggest gain-of-function)
  - 2D hotspots
  - 3D hotspots
- Patterns that suggest loss of function
- Actual experimental evidence for the specific variant or one very similar
What other approaches can you think of?

Module 5 Lecture

Pathway visualization

0004-04-01T00:00:00+00:00

A common task after pathway analysis is contructing visualizations to represent experimental data for pathways of interest. There are many tools for this. We will focus on the bioconductor pathview package for this task.

Pathview

Pathview is used to integrate and display data on KEGG pathway maps that it retrieves through API queries to the KEGG database. Please refer to the pathview vignette and KEGG website for license information as there may be restrictions for commercial use due for these API queries. Pathview itself is open source and is able to map a wide variety of biological data relevant to pathway views. In this section we will be mapping the overall expression results for a few pathways from the pathway analysis section of this course. Let’s start by installing pathview from bioconductor and loading the data we created in the previous section.

# Install pathview from bioconductor
source("https://bioconductor.org/biocLite.R")
biocLite("pathview")
library(pathview)

load(url("http://genomedata.org/gen-viz-workshop/pathway_visualization/pathview_Data.RData"))

Visualizing KEGG pathways

Now that we have our initial data loaded let’s choose a few pathways to visualize. The “Mismatch repair” repair pathway is significantly perturbed by up regulated genes, and corresponds to the following kegg id: “hsa03430”. We can view this using the row names of the pathway dataset fc.kegg.sigmet.p.up. Let’s use our experiment’s expression in the data frame tumor_v_normal_DE.fc and view it in the context of this pathway. Two graphs will be written to your current working directory by the pathview() function, one will be the original kegg pathway view and the second one will have expression values overlayed (see below). You can find your current working directory with the function getwd().

# View the hsa03430 pathway from the pathway analysis
fc.kegg.sigmet.p.up[grepl("hsa03430", rownames(fc.kegg.sigmet.p.up), fixed=TRUE),]

# Overlay the expression data onto this pathway
pathview(gene.data=tumor_v_normal_DE.fc, species="hsa", pathway.id="hsa03430")

It is often nice to see the relationship between genes in the kegg pathview diagrams, this can be achieved by setting the parameter kegg.native=FALSE. Below we show an example for the Fanconi anemia pathway.

# View the hsa03430 pathway from the pathway analysis
fc.kegg.sigmet.p.up[grepl("hsa03460", rownames(fc.kegg.sigmet.p.up), fixed=TRUE),]

# Overlay the expression data onto this pathway
pathview(gene.data=tumor_v_normal_DE.fc, species="hsa", pathway.id="hsa03460", kegg.native=FALSE)

Pathway analysis

0004-03-01T00:00:00+00:00

In the previous section we examined differential expression of genes from the E-GEOD-50760 data set. In this section we will use the gage package to determine if there are any coordinated differential expression patterns in the data set we used for differential expression, E-GEOD-50760.

What is gage?

generally applicable gene-set enrichment (gage) is a popular bioconductor package for performing gene-set and pathway analysis. The package works independent of sample sizes, experimental designs, assay platforms, and is applicable to both microarray and rnaseq data sets. In this section we will use gage and gene sets from the “Kyoto Encyclopedia of Genes and Genomes” (KEGG) and “Gene Ontology” (GO) databases to perform pathway analysis. Let’s go ahead and install gage and load the differential expression results from the previous section.

# install gage
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("gage","GO.db","AnnotationDbi","org.Hs.eg.db"), version = "3.8")
library(gage)

# load the differential expression results fro the previous section
load(url("http://genomedata.org/gen-viz-workshop/intro_to_deseq2/tutorial/deseq2Data_v1.RData"))

# extract the results from the deseq2 data
library(DESeq2)
tumor_v_normal_DE <- results(deseq2Data, contrast=c("tissueType", "primary colorectal cancer", "normal-looking surrounding colonic epithelium"))

setting up gene set databases

In order to perform our pathway analysis we need a list of pathways and their respective genes. The most common databases for this type of data are KEGG and GO. The gage package has two functions for querying this information in real time, kegg.gsets() and go.gsets(), both of which take a species as an argument and will return a list of gene sets and some helpful meta information for subsetting these list. For the KEGG database object kg.hsa$kg.sets stores all gene sets for the queried species; kg.hsa$sigmet.idx and kg.hsa$dise.idx store the indices for those gene sets which are classified as signaling and metabolism and disease respectively. We use this information to extract a list of gene sets for the signaling and metabolism and disease subsets. A similar process is used for the GO gene sets splitting the master gene set into the three gene ontologies: “Biological Process”, “Molecular Function”, and “Cellular Component”.

# set up kegg database
kg.hsa <- kegg.gsets(species="hsa")
kegg.sigmet.gs <- kg.hsa$kg.sets[kg.hsa$sigmet.idx]
kegg.dise.gs <- kg.hsa$kg.sets[kg.hsa$dise.idx]

# set up go database
go.hs <- go.gsets(species="human")
go.bp.gs <- go.hs$go.sets[go.hs$go.subs$BP]
go.mf.gs <- go.hs$go.sets[go.hs$go.subs$MF]
go.cc.gs <- go.hs$go.sets[go.hs$go.subs$CC]

annotating genes

We have our gene sets now however if you look at one of these objects containing the gene sets you’ll notice that each gene set contains a series of integers. These integers are actually entrez gene identifiers which presents a problem as our DESeq2 results use ensemble ID’s as gene identifiers. We will need to convert our gene identifiers to the same format before we perform the pathway analysis. Fortunately bioconductor maintains genome wide annotation data for many species, you can view these species with the OrgDb bioc view. This makes converting the gene identifiers relatively straight forward, below we use the mapIds() function to query the OrganismDb object for the gene symbol, entrez id, and gene name based on the ensembl id. Because there might not be a one to one relationship with these identifiers we also use multiVals="first" to specify that only the first identifier should be returned in such cases.

# load in libraries to annotate data
library(AnnotationDbi)
library(org.Hs.eg.db)

# annotate the deseq2 results with additional gene identifiers
tumor_v_normal_DE$symbol <- mapIds(org.Hs.eg.db, keys=row.names(tumor_v_normal_DE), column="SYMBOL", keytype="ENSEMBL", multiVals="first")
tumor_v_normal_DE$entrez <- mapIds(org.Hs.eg.db, keys=row.names(tumor_v_normal_DE), column="ENTREZID", keytype="ENSEMBL", multiVals="first")
tumor_v_normal_DE$name <- mapIds(org.Hs.eg.db, keys=row.names(tumor_v_normal_DE), column="GENENAME", keytype="ENSEMBL", multiVals="first")

Preparing DESeq2 results for gage

Before we perform the actuall pathway analysis we need to format our differential expression results into a format suitable for the gage package. Basically this means obtaining the normalized log2 expression values and assigning entrez gene identifiers to these values.

# grab the log fold changes for everything
tumor_v_normal_DE.fc <- tumor_v_normal_DE$log2FoldChange
names(tumor_v_normal_DE.fc) <- tumor_v_normal_DE$entrez

Running pathway analysis

We can now use the gage() function to obtain the significantly perturbed pathways from our differential expression experiment. By default the gage package performs this analysis while taking into account up and down regulation. Setting same.dir=FALSE will capture pathways perturbed without taking into account direction. This is generally not recommended for the GO groups as most genes within these gene sets are regulated in the same direction, however the same is not true for KEGG pathways and using this parameter may produce informative results in such cases.

Note on the abbreviations below: “bp” refers to biological process, “mf” refers to molecular function, and “cc” refers to cellular process. These are the three main categories of gene ontology terms/annotations.

# Run enrichment analysis on all log fc
fc.kegg.sigmet.p <- gage(tumor_v_normal_DE.fc, gsets = kegg.sigmet.gs)
fc.kegg.dise.p <- gage(tumor_v_normal_DE.fc, gsets = kegg.dise.gs)
fc.go.bp.p <- gage(tumor_v_normal_DE.fc, gsets = go.bp.gs)
fc.go.mf.p <- gage(tumor_v_normal_DE.fc, gsets = go.mf.gs)
fc.go.cc.p <- gage(tumor_v_normal_DE.fc, gsets = go.cc.gs)

# covert the kegg results to data frames
fc.kegg.sigmet.p.up <- as.data.frame(fc.kegg.sigmet.p$greater)
fc.kegg.dise.p.up <- as.data.frame(fc.kegg.dise.p$greater)

fc.kegg.sigmet.p.down <- as.data.frame(fc.kegg.sigmet.p$less)
fc.kegg.dise.p.down <- as.data.frame(fc.kegg.dise.p$less)

# convert the go results to data frames
fc.go.bp.p.up <- as.data.frame(fc.go.bp.p$greater)
fc.go.mf.p.up <- as.data.frame(fc.go.mf.p$greater)
fc.go.cc.p.up <- as.data.frame(fc.go.cc.p$greater)

fc.go.bp.p.down <- as.data.frame(fc.go.bp.p$less)
fc.go.mf.p.down <- as.data.frame(fc.go.mf.p$less)
fc.go.cc.p.down <- as.data.frame(fc.go.cc.p$less)

Which genes are in > 30% of significant pathways in the upregulated GO biological process results (q <= .05)

Two genes are, ATM, CCNB1. Here is an Rscript to get the correct answer.