Introduction to waterfall plots+
After sequencing a set of samples a common question to ask is what mutations are present in the sample/patient cohort. This type of data is often displayed in a heatmap like structure with rows and columns denotating genes and samples with these variables ordered in a hierarchical pattern based on recurrence. Such plots make viewing mutually exclusive and co-occurring mutational events a trivial task. The waterfall() function from the GenVisR package makes it easy to create these types of charts while also allowing additional data in the context of sample and gene data to be added to the plot. In this tutorial we will use the waterfall() function to re-create panel C of figure 2 from the paper “A Phase I Trial of BKM120 (Buparlisib) in Combination with Fulvestrant in Postmenopausal Women with Estrogen Receptor-Positive Metastatic Breast Cancer.”.
Installing GenVisR and loading Data
To begin let’s install and load the GenVisR library from bioconductor. We will also need to load the mutation data from Supplemental Table 3, a version that has been converted to a tab delimited format is available for download here. We also need some supplemental information regarding the clinical data which is available here and mutation burden data available here.
# Install and load GenVisR source("https://bioconductor.org/biocLite.R") biocLite("GenVisR") library(GenVisR) # Set your working directory - change this to where you downloaded your files setwd("~/Downloads") # Load relevant data from the manuscript mutationData <- read.delim("BKM120_Mutation_Data.tsv") clinicalData <- read.delim("BKM120_Clinical.tsv") mutationBurden <- read.delim("BKM120_MutationBurden.tsv")
Creating the initial plot
The waterfall() function is designed to work with specific file types read in as data frames, the default being MAF files, however the option exists to use custom file types as long as the column names “sample”, “gene”, and “variant_class” are present. This is done by setting
fileType="Custom". Let’s go ahead and re-format our mutation data and create an initial plot using this parameter. Note that when using a custom file type the priority of mutation types must be specified with the
variant_class_order parameter which accepts a character vector. We’ll explain what this does in the next section, for now just give it a character vector containing all mutation types present in
# Reformat the mutation data for waterfall() mutationData <- mutationData[,c("patient", "gene.name", "trv.type", "amino.acid.change")] colnames(mutationData) <- c("sample", "gene", "variant_class", "amino.acid.change") # Create a vector to save mutation priority order for plotting mutation_priority <- as.character(unique(mutationData$variant_class)) # Create an initial plot waterfall(mutationData, fileType = "Custom", variant_class_order=mutation_priority)
Setting the mutation hierarchy
As you can see from the previous plot, waterfall() plots mutations in a method similar to a heatmap, however the astute user may be asking what happens in cases where there are two different mutations types for the same cell in the heatmap. The answer is waterfall() picks one based on a mutation hierarchy. For a known file format such as MAF this hierarchy is pre set such that the more deleterious mutations are higher on the hierarchy. However, for a “custom” file type there is no apriori knowledge of what this order should be and thus must be set by the user. As mentioned, this is done with the
variant_class_order parameter which accepts a character vector of all mutation types within the mutation data input in order of most to least deleterious. The effect this parameter has is best exemplified in the example below.
# Make sure seed is set to 426 to reproduce! set.seed(426) # Install and load the gridExtra package #install.packages("gridExtra") library(gridExtra) # Create a data frame of random elements to plot inputData <- data.frame(sample = sample(letters[1:5], 20, replace = TRUE), gene = sample(letters[1:5], 20, replace = TRUE), variant_class = sample(c("x", "y", "z"), 20, replace = TRUE)) # Choose the most deleterious to plot with y being defined as the most # deleterious most_deleterious <- c("y", "z", "x") # Plot the data with waterfall using the 'Custom' parameter p1 <- waterfall(inputData, fileType = "Custom", variant_class_order = most_deleterious, mainXlabel = TRUE, out="grob") # Change the most deleterious order p2 <- waterfall(inputData, fileType = "Custom", variant_class_order = rev(most_deleterious), mainXlabel = TRUE, out="grob") # Arrange the two plots side by side grid.arrange(p1, p2, ncol=2) # Summarize the mutation types for a given sample/gene inputData[inputData$sample=="e" & inputData$gene=="a",]
Notice that in the figure above, the top left cell (sample: e, gene: a) has mutations of two types (y and z). Between the two plots we reversed the hierarchy of the mutations specified in
variant_class_order causing mutation of type “z” to have a higher precedence in the right most plot. As we would expect waterfall() then displays the color for mutation type “z” instead of “y” in this cell. Let’s go ahead and set a
variant_class_order that makes sense for the breast cancer plot we’re working on. Remember this must be a character vector and contain all mutation types in the data frame
# Define a mutation hierarchy mutationHierarchy <- c("nonsense", "frame_shift_del", "frame_shift_ins", "in_frame_del", "splice_site_del", "splice_site", "missense", "splice_region", "rna") # Create waterfall plot waterfall(mutationData, fileType = "Custom", variant_class_order=mutationHierarchy)
Changing the color of tiles
Often it is desirable to change the colors of the plotted cells, either for purely aesthetic reasons, or to group similar mutation types. The waterfall() parameter which allows this is
mainPalette which expects a character vector mapping mutation types to acceptable R colors. Let’s go ahead and match the colors in our plot to the one in the paper.
# define colours for all mutations mutationColours <- c("nonsense"='#4f00A8', "frame_shift_del"='#A80100', "frame_shift_ins"='#CF5A59', "in_frame_del"='#ff9b34', "splice_site_del"='#750054', "splice_site"='#A80079', "missense"='#009933', "splice_region"='#ca66ae', "rna"='#888811') # create waterfall plot waterfall(mutationData, fileType = "Custom", variant_class_order=mutationHierarchy, mainPalette=mutationColours)
Adding a custom mutation burden
You might notice that while the mutation burden between the manuscript plot and our plot are very similar they are not exactly the same. The waterfall() function aproximates the mutation burden via the formula
(# of mutations)/(coverage space) * 1,000,000 where the coverage space is controlled by the
coverageSpace parameter and takes an integer giving the number of base pairs adequately covered in the experiment. This is only an aproximation as the coverage per sample can fluctuate, in situations such as this the user has the option of providing user defined mutation burden calculation for each sample for which to plot. This is done with the
mutBurden parameter which takes a data frame with two columns, “sample” (which should match the samples in
mutationData) and “mut_burden” (giving the actual value to plot). We’ve downloaded what the actual values are and stored them in the
mutationBurden data frame we created above. You’ll notice the samples between
mutationData don’t quite match. Specifically it looks like an identifier “WU0” or “W00” has been added to each sample in the
mutationBurden data frame. Let’s fix this using a regular expression with gsub and use these mutation burden values for our plot.
# Find which samples are not in the mutationBurden data frame # First, let's look at the sample names in the mutationData and mutationBurden mutationData$sample mutationBurden$sample # Now, determine the non-overlap between these values sampleVec <- unique(mutationData$sample) sampleVec[!sampleVec %in% mutationBurden$sample] # Fix mutationBurden to match mutationData mutationBurden$sample <- gsub("^WU(0)+", "", mutationBurden$sample) # Check for non-overlap again sampleVec[!sampleVec %in% mutationBurden$sample] # Create the waterfall plot waterfall(mutationData, fileType = "Custom", variant_class_order=mutationHierarchy, mainPalette=mutationColours, mutBurden=mutationBurden)
At this stage it is appropriate to talk about the subsetting functions within waterfall(). We could subset
mutationData to include, for example, just some specific genes that we wish to plot. However, doing so would reduce the accuracy of the mutation burden top sub-plot if not using custom values. The waterfall() function contains a number of parameters for limiting the genes and mutations plotted without affecting the mutation burden calculation. These parameters are
How would you create a waterfall plot showing only genes which are mutated in 15% of samples?
set the mainRecurCutoff parameter to .15
Adding clinical data
As stated previously waterfall() can display additional data in the bottom sub-plot. In order to do this a data frame in long format with column names “sample”, “variable”, “value” must be given to the parameter
clinData. As with the
mutBurden parameter, the samples in both data frames must match. Let’s go ahead and reproduce the clinical sub-plot from the manuscript figure. We will also use the
clinVarOrder parameters to specify the number of columns, the colours, and the order of variables for the legend respectively. We also apply the
section_heights parameter which takes a numeric vector providing the ratio of each verticle plot. In this situation we have a total of three verticle plots, the mutation burden, main, and clinical plot so we need a numeric vector of length 3.
# reformat clinical data to long format library(reshape2) clinicalData_2 <- clinicalData[,c(1,2,3,5)] colnames(clinicalData_2) <- c("sample", "Months on Study", "Best Response", "Treatment Setting") clinicalData_2 <- melt(data=clinicalData_2, id.vars=c("sample")) # find which samples are not in the mutationBurden data frame sampleVec <- unique(mutationData$sample) sampleVec[!sampleVec %in% clinicalData$sample] # fix mutationBurden to match mutationData clinicalData_2$sample <- gsub("^WU(0)+", "", clinicalData_2$sample) # create the waterfall plot waterfall(mutationData, fileType = "Custom", variant_class_order=mutationHierarchy, mainPalette=mutationColours, mutBurden=mutationBurden, clinData=clinicalData_2, clinLegCol=3, clinVarCol=c('0-6'='#ccbadc', '6.1-12'='#9975b9', '12.1+'='#663096', 'Partial Response'='#c2ed67', 'Progressive Disease'='#E63A27', 'Stable Disease'='#e69127', '1'='#90ddee', '2'='#649aa6', '3+'='#486e77'), clinVarOrder=c('1', '2', '3+', 'Partial Response', 'Stable Disease', 'Progressive Disease', '0-6', '6.1-12', '12.1+'), section_heights=c(1, 5, 1))
Adding cell labels
You might have noticed when we subsetted our data frame at the begining of this tutorial we kept an extra column named “amino.acid.change”. This was done so that we could use this column to add cell labels to our plot as the authors of the manuscript figure did. In order to do this we only need specify the column in
mutationData from which to pull labels from, we do this with
mainLabelCol="amino.acid.change". Further we adjust the size of the labels so that they fit in the cells with
mainLabelSize = 3. You can also change the angle of labels with the parameter
mainLabelAngle which may be useful if the the cells on a plot are not perfectly square.
# create the waterfall plot waterfall(mutationData, fileType = "Custom", variant_class_order=mutationHierarchy, mainPalette=mutationColours, mutBurden=mutationBurden, clinData=clinicalData_2, clinLegCol=3, clinVarCol=c('0-6'='#ccbadc', '6.1-12'='#9975b9', '12.1+'='#663096', 'Partial Response'='#c2ed67', 'Progressive Disease'='#E63A27', 'Stable Disease'='#e69127', '1'='#90ddee', '2'='#649aa6', '3+'='#486e77'), clinVarOrder=c('1', '2', '3+', 'Partial Response', 'Stable Disease', 'Progressive Disease', '0-6', '6.1-12', '12.1+'), section_heights=c(1, 5, 1), mainLabelCol="amino.acid.change", mainLabelSize = 3)
What is the most commonly mutated gene in this cohort?
PIK3CA is the most commonly mutated, in 50% (8/16) samples.
Are there any recurrently mutated amino acid positions in this gene?
Yes. E542K, H1047R, and E545K are each mutated twice.
If yes, are they known mutation hotspots? How would you determine this?
Yes. E542K, E545K and H1047R are all well known hotspots of mutation in PIK3CA, especially in human breast cancer. You could determine this by exploring PIK3CA mutation frequencies in COSMIC or ProteinPaint.
Re-arranging genes and samples
In some cases it may be desireable to forgo the default hierarchical ordering of cells in favor of a more custom approach. We can see in the authors original figure they elected to do this such that samples were grouped by months on study and gene were grouped by pathway. We can achieve this with the parameters
sampOrder both of which take a character vector giving the order in which to plot row and column cells respectively.
# Create a sample ordering sample_ordering <- c("19", "5", "31", "22", "12", "2", "32", "8", "28", "18", "4", "24", "23", "17", "11", "14") # Create a gene ordering gene_ordering <- c("CDH1", "MALAT1", "RUNX1", "NCOR1", "GATA3", "FOXA1", "ESR1", "CBFB", "TBX3", "TAB1", "MED12", "XBP1", "TP53", "RB1CC1", "BRCA2", "ATM", "SMARCD1", "MLL3", "MLL2", "ARID1A", "FBXW7", "CAV1", "MAP3K1", "MAP2K4", "NOTCH4", "PDGFRA", "ERBB3", "ERBB2", "RELN", "MAGI3", "MTOR", "AKT2", "AKT1", "PTEN", "PIK3CA") # Create a gene ordering waterfall(mutationData, fileType = "Custom", variant_class_order=mutationHierarchy, mainPalette=mutationColours, mutBurden=mutationBurden, clinData=clinicalData_2, clinLegCol=3, clinVarCol=c('0-6'='#ccbadc', '6.1-12'='#9975b9', '12.1+'='#663096', 'Partial Response'='#c2ed67', 'Progressive Disease'='#E63A27', 'Stable Disease'='#e69127', '1'='#90ddee', '2'='#649aa6', '3+'='#486e77'), clinVarOrder=c('1', '2', '3+', 'Partial Response', 'Stable Disease', 'Progressive Disease', '0-6', '6.1-12', '12.1+'), section_heights=c(1, 5, 1), mainLabelCol="amino.acid.change", mainLabelSize=3, sampOrder=sample_ordering, geneOrder=gene_ordering)