Welcome to the blog


My thoughts and ideas

Introduction to ggplot2* | Griffith Lab

Genomic Visualization and Interpretations

Introduction to ggplot2*

There are at least three primary graphics programs available within the R environment. A package for base R graphics is installed by default and provides a simple mechanism to quickly create graphs. lattice is another graphics package that attempts to improve on base R graphics by providing better defaults and the ability to easily display multivariate relationships. In particular, the package supports the creation of trellis graphs - graphs that display a variable or the relationship between variables, conditioned on one or more other variables. Finally, ggplot2 is a graphics program based on the grammar of graphics ideology, and will be the primary focus of this course.

In this module, we will explore basic use of ggplot2 to plot genomic data. For illustration, we will use a set of mutation data from Supplemental Table S5 of the paper “Recurrent somatic mutations affecting B-cell receptor signaling pathway genes in follicular lymphoma”. You can download a cleaned up version of Supplemental Table S5 at http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv. These data come in TSV format, and consist of ~5000 lines, each with a somatic tumor genome variant and various annotations describing the variant (e.g. affected individual, predicted consequence, read counts, etc.).

Introducing ggplot2 syntax

ggplot is based on a system of layering graphical objects to create a final plot. We will start by installing and loading the ggplot2 library. Next, it is important to know that ggplot expects the data passed to it to be of class data.frame. After importing our data (‘ggplot2ExampleData.tsv’), we will modify this data frame to include a ‘coverage’ (tumor_COV) variable. Then we can call the variantData data frame in our ggplot() function and compare the coverage variable to the variant allele frequency (tumor_VAF).

# install the ggplot2 library and load it

# load Supplemental Table S5
# note that in the following example we are loading directly from a URL (instead of downloading it to the instance first)
variantData <- read.delim("http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv", stringsAsFactors=TRUE)

# familiarize yourself with the data in this file by looking at the 'head' (top) of the file

# look at a selection of columns
variantData[1:5, c(1:9,16,18)]

# make a coverage column since this doesn't exist yet
variantData$tumor_COV <- variantData$tumor_ref_count + variantData$tumor_var_count

# start making the plot to visualize the relationship between variant allele frequencies and read coverage
p1 <- ggplot(data=variantData, aes(x=tumor_VAF, y=tumor_COV))

Geometric objects and aesthetic mapping

The object stored in variable p1 will generate a blank plot in the bottom right “Plots” window of Rstudio. We invoked ggplot with the function ggplot() and specified the data frame we are trying to plot. We then supplied aesthetic mappings with aes(). In essence this is specifying which columns ggplot should assign to the geometric objects aesthetics. In this specific case, we are telling ggplot that the data is in the data frame “variantData”, the column tumor_VAF should be plotted along the x-axis, and tumor_COV should be plotted along the y-axis. ggplot has determined the axis scales, given the ranges in the data supplied. However you will notice that nothing is actually plotted. At this point we have passed what data we want plotted however we have not specified how it should be plotted. This is what the various geometric objects in ggplot are used for (e.g. geom_point() for scatterplots, geom_bar() for bar charts, etc). These geometric objects are added as plot layers to the ggplot() base command using a +.

# add a point geom object to the plot (method 1)
p1 <- p1 + geom_point()

# the following is equivalent to above (method 2)
p2 <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=tumor_COV))

Both plot p1 and plot p2 generate a scatter plot, comparing the column tumor_COV on the y-axis to the column tumor_VAF on the x-axis. While the plots generated by the p1 and p2 variables may appear identical, we should briefly address their differences. In method 1 (plot p1), we invoke ggplot() with a data frame (variantData) and an aesthetic mapping of tumor_VAF and tumor_COV for x and y respectively. In this method the information is passed to all subsequent geometric objects and is used as appropriate in those objects. In this case, the only geometric object we include is geom_point(). The geom_point() layer is then added using the information passed from ggplot(). Conversely, in method 2 (plot p2), we invoke ggplot() without defining the data or aesthetic mapping. This information is specified directly in the geom_point() layer. If any additional geometric objects were added as layers to the plot, we would specifically have to define the data and aesthetics within each additional layer. This is especially useful when plotting multiple datasets on the same plot (we will explore this later on).

We should also note that geometric objects can behave differently, depending upon whether the plotted variables are continuous or discrete. In the example below (plot p3), we can see that the points have been binned by chromosome name on the x-axis, while the numeric values sorted in the column “tumor_VAF” are plotted along a continuous y-axis scale. The position of the points (specified by position=’jitter’ in the geom_point() object) shifts the points apart horizontally to provide better resolution. A complete list of available geoms within ggplot is available here.

# illustrating the difference between continous and discrete data
p3 <- ggplot() + geom_point(data=variantData, aes(x=chromosome_name, y=tumor_VAF), position="jitter")

Axis scaling and manipulation

Going back to our original example (plot p1), the majority of points look like they have a coverage < 500x. However, there are outliers in this data set causing the majority of points to take up a relatively small portion of the plot. We can provide more resolution to this by the following methods:

  1. limiting the y-axis scale using scale_y_continous() or ylim()
  2. transforming the numeric values by log2() on the y-axis.
  3. transforming the y-axis to a log2 scale by specifying trans within the scale_y_continous() layer.

Note that these transformations can be applied to the x axis as well (scale_x_continous(), xlim(), etc.), as long as the x-axis is mapped to data appropriate for a continuous scale.

# method 1, set y limits for the plot
p4 <- p1 + scale_y_continuous(limits=c(0, 500))
# alternatively, the following is a shortcut method for the same
p4 <- p1 + ylim(c(0, 500))

# method 2, transform the actual values in the data frame
p5 <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=log2(tumor_COV)))

# method 3, transform the axis using ggplot
p6 <- p1 + scale_y_continuous(trans="log2")

Note that adjusting the scale_y_continuous() layer will plot only the points within the specified range by setting the limits. ylim() is a shortcut that achieves the same thing. You’ll see a warning when doing this, stating that rows are being removed from the data frame that contain values outside of the specified range. There is an “out of bounds” parameter within scale_y_continuous() to control what happens with these points, but this isn’t necessarily the best method for this particular plot. In method 2 (plot p5), we actually transform the data values themselves by applying a log2 transform. This method allows us to better visualize all of the points, but it is not intuitive to interpret the log2 of a value (tumor coverage). Alternatively, method 3 (plot p6) does not transform the values, but adjusts the scale the points are plotted on and does a better job of displaying all of our data without having to convert the log2 values.

Discussion: What are the pros and cons of the three approaches above? Are there other approaches we might consider?

# method 4, show all the data on linear scale but use a density plotting function to better see where the bulk of data point are
# also add some additional annotations, customize color palette, etc.
p6a = ggplot(data=variantData, aes(x=tumor_VAF, y=tumor_COV)) + geom_hex(bins=75) + scale_fill_continuous(type = "viridis") + theme_bw() + xlab("Variant allele fraction (VAF)") + ylab("Sequence depth (aka Read coverage)") + ggtitle("Somatic tumor variants - VAF vs Coverage")

Applying different aesthetics

While these plots look pretty good, we can make them more aesthetically pleasing by defining the color of the points within the aesthetic. We can specify a color by either the hex code (hex codes explained) or by naming it from R’s internal color pallette, a full list of which is available here. Alternatively, you can list colors by typing colors() in the R terminal.

# list colors in R

# obtain hex code for a specific color called "dark orchid"

# what happens when we try to add color within the aesthetic?
p7 <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=tumor_COV, color="#9932CC")) + scale_y_continuous(trans="log2")

# and what happens when we try to add color within the geom?
p8 <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=tumor_COV), color="#9932CC") + scale_y_continuous(trans="log2")

Above we chose “darkorchid4” which has a hex value of “#9932CC”. However the points in the first plot (p7) are red and not the expected purple color. Our points are appearing miscolored based upon how ggplot is interpreting the aesthetic mappings. When the color aesthetic is specified for geom_point it expects a factor variable by which to color points. If we wanted to, for example, color all missense variants with one color, nonsense variants with another color, and so on we could supply a factor for variant type to the color aesthetic in this way. But, when we specified a quoted hex code, ggplot assumed we wanted to create such a factor with all values equal to the provided text string. It did this for us and then used its internal color scheme to color that variable all according to the single category in the factor variable. By specifying the color outside the aesthetic mapping, geom_point knows to apply the color ‘darkorchid4’ to all of the points specified in the geom_point() layer (p8).

The syntax used in p8 makes sense if we want to display our points in a single color and we want to specify that color. The syntax used in p7 doesn’t make sense as used above but something similar could be used if we really did want to color each point according to a real factor in the data. For example, coloring points by the ‘dataset’, ‘type’, or ‘variant’ variables could be informative. Try one of these now.

# color each point according to the 'dataset' of the variant
p7a <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=tumor_COV, color=dataset)) + scale_y_continuous(trans="log2")

# color each point according to the 'type' of the variant
p7b <- ggplot() + geom_point(data=variantData, aes(x=tumor_VAF, y=tumor_COV, color=type)) + scale_y_continuous(trans="log2")

Building on the above concepts, we could now try using the colour aesthetic to visualize our data as a density plot. For example, what if we wanted to know if the ‘discovery’ or ‘extension’ cohorts within our data (specified by the ‘dataset’ variable) had a higher tumor purity? We will use geom_density() to plot a density kernel of the tumor VAF values, but colour the cohort based upon the dataset subsets. As described above, we will supply a factor for the colour aesthetic.

# get a density curve of tumor vafs
p9 <- ggplot() + geom_density(data=variantData, aes(x=tumor_VAF, color=dataset))

# let's add a bit more detail
p10 <- ggplot() + geom_density(data=variantData, aes(x=tumor_VAF, fill=dataset), alpha=.75, adjust=.5)

# and let's change the colors some more
p11 <- p10 + scale_fill_manual(values=c("discovery"="#a13242", "extension"="#1a2930"))

In the p9 plot, we told the geom_density() layer to differentiate the data based upon the ‘dataset’ column using the colour aesthetic. We see that our result contains two density curves that use two different colored lines to specify our datasets. In p10, we are telling our geom_density() layer to differentiate the datasets using the “fill” aesthetic instead. We globally assign the fill transparency (alpha=0.75). In addition, we utilize the adjust parameter to reduce the smoothing geom_density() uses when computing it’s estimate. Now, our datasets are specified by the fill (or filled in color) of each density curve. In p11 (shown above), we append the scale_fill_manual() layer to manually define the fill colours we would like to appear in the plot.

As an exercise, try manually changing the line colors in p9 using a similar method as that used in p11.

Get a hint!

look at scale_colour_manual()


ggplot() + geom_density(data=variantData, aes(x=tumor_VAF, color=dataset)) + scale_color_manual(values=c("discovery"="#a13242", "extension"="#1a2930"))

Note that when you use the “color” aesthetic you modify the choice of line colors with scale_color_manual. When you use the “fill” aesthetic you modify the choice of fill colors with scale_fill_manual. If you would like to customize both the line and fill colors, you will need to define both the “color” and “fill” aesthetic.

Try it. Use four different colors for the two line and two fill colors so that it is easy to see if it worked.


Depending on the geometric object used there are up to 9 ways to map an aesthetic to a variable. These are with the x-axis, y-axis, fill, colour, shape, alpha, size, labels, and facets.

Faceting in ggplot allows us to quickly create multiple related plots at once with a single command. There are two facet commands, facet_wrap() will create a 1 dimensional sequence of panels based on a one sided linear formula. Similarly facet_grid() will create a 2 dimensional grid of panels. Let’s try and answer a few quick questions about our data using facets.

# what is the most common mutation type among SNPs?
p12 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type))

# use theme() rotate the labels for readability (more on themes below)
p12a <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type)) + theme(axis.text.x = element_text(angle = 90))

# what is the relation of tiers to mutation type?
p13 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type, fill=tier)) + theme(axis.text.x = element_text(angle = 90))

# which reference base is most often mutated? compare p13 and p14 to understand what facet_wrap is doing here
p14 <- p13 + facet_wrap(~reference)

# which transitions and transversions occur most frequently?
p15 <- p13 + facet_grid(variant ~ reference)

What do we see in this last plot? Which base changes are most common in this data set? Do we expect a random/uniform distribution of base changes?

Note that the variant bases in this plot are along the Y-axis, and the reference bases are along the X-axis. Thus the first row of panels is A->A, C->A, G->A, and T->A variants. Overall the most common mutations are G->A and C->T. In other words we are seeing more transitions than transversions: G->A (purine -> purine transition) and C->T (pyrimidine to pyrimidine transition). This is what we expect for various reasons.

Also note how we are selecting a subset of the “variantData” data above. Try the following commands to breakdown how this works:

variantData$type == "SNP"
x = variantData[variantData$type == "SNP",]

ggplot Themes

Almost every aspect of a ggplot object can be altered. We’ve already gone over how to alter the display of data but what if we want to alter the display of the non data elements? Fortunately there is a function for that called theme(). You’ll notice in the previous plot some of the x-axis names are colliding with one another, let’s fix that and alter some of the theme parameters to make the plot more visually appealing.

# recreate plot p13 if it's not in your environment
p16 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type, fill=tier)) + facet_grid(variant ~ reference)

# load in a theme with a few presets set
p17 <- p16 + theme_bw()

# put the x-axis labels on a 45 degree angle
p18 <- p17 + theme(axis.text.x=element_text(angle=45, hjust=1))

# altering a few more visual apects (put the legend at top and make the base change labels white on a black background)
p19 <- p18 + theme(legend.position="top", strip.text=element_text(colour="white"), strip.background=element_rect(fill="black"))

# let's remove the main x-axis label as well
p20 <- p19 + theme(axis.title.x=element_blank())

Let’s take a few minutes to discuss what is going on here. In p17, we’ve used theme_bw(), this function just changes a series of values from the basic default theme(). There are many such “complete themes” in ggplot and a few external packages as well containing additional “complete themes” such as ggtheme. In p18, we alter the axis.text.x parameter, we can see from the documentation that axis.text.x inherits from element_text() which is just saying that any parameters in element_text() also apply to axis.text.x. In this specific case we alter the angle of text to 45 degrees, and set the horizontal justification to the right. In p19 we change the position of the legend, change the colour of the strip.text, and change the strip background. Finally in p20 (shown above) we remove the x-axis label with element_blank(), which will draw nothing.

Changing the order of aesthetic mappings

In ggplot the order in which a variable is plotted is determined by the levels of the factor for that variable. We can view the levels of a column within a dataframe with the levels() command and we can subsequently change the order of those levels with the factor() command. We can then use these to change the order in which aesthetic mappings such as plot facets and discrete axis variables are plotted. Lets look at an example using the previous faceted plots we made (p20).

# recreate plot p20 if it's not in your environment
p20 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type, fill=tier)) + facet_grid(variant ~ reference) + theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1), legend.position="top", strip.text=element_text(colour="white"), strip.background=element_rect(fill="black"), axis.title.x=element_blank())

# view the order of levels in the reference and trv_type columns

# reverse the order of the levels
variantData$reference <- factor(variantData$reference, levels=rev(levels(variantData$reference)))
variantData$trv_type <- factor(variantData$trv_type, levels=rev(levels(variantData$trv_type)))

# view the updated order of levels for the try_type column

# view the same plot as p20 but with the new order of variables
p21 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type, fill=tier)) + facet_grid(variant ~ reference) + theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1), legend.position="top", strip.text=element_text(colour="white"), strip.background=element_rect(fill="black"), axis.title.x=element_blank())

We can see that reversing the order of levels in the reference column has subsequently reversed the reference facets (right side of plot). Similarily reversing the order of the trv_type column levels has reversed the labels on the x-axis.

Manually fix some factor names, and add main x and y axis labels

# reverse the order of the trv_type levels back to original state
variantData$trv_type <- factor(variantData$trv_type, levels=rev(levels(variantData$trv_type)))

# if we want to modify the name of some of the levels manually (e.g. to make shorter versions of some really long names) we can do the following
levels(variantData$trv_type)[levels(variantData$trv_type)=="3_prime_untranslated_region"] <- "3p_utr"
levels(variantData$trv_type)[levels(variantData$trv_type)=="5_prime_untranslated_region"] <- "5p_utr"
levels(variantData$trv_type)[levels(variantData$trv_type)=="5_prime_flanking_region"] <- "5p_flank"

# update the plot yet again
p22 <- ggplot(variantData[variantData$type == "SNP",]) + geom_bar(aes(x=trv_type, fill=tier)) + facet_grid(variant ~ reference) + theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1), legend.position="bottom", strip.text=element_text(colour="white"), strip.background=element_rect(fill="black"), axis.title.x=element_blank()) + ylab("variant count")

# add some space around the margins of the plot
p22 <- p22 +  theme(plot.margin = unit(c(1.5,1.5,0.2,0.2), "cm"))

# add main x and y labels
grid.text(unit(0.5,"npc"), unit(0.95,"npc"), label = "reference base", rot = 0, gp=gpar(fontsize=11))
grid.text(unit(0.97,"npc"), 0.56, label = "variant base", rot = 270, gp=gpar(fontsize=11))

Saving ggplot2 plots

To save a plot or any graphical object in R, you first have to initalize a graphics device, this can be done with pdf(), png(), svg(), tiff(), and jpeg(). You can then print the plot as you normally would and close the graphics device using dev.off(). Alternatively ggplot2 has a function specifically for saving ggplot2 graphical objects called ggsave(). A helpfull tip to keep in mind when saving plots is to allow enough space for the plot to be plotted, if the plot titles for example look truncated try increasing the size of the graphics device. Changes in the aspect ratio between a graphics device height and width will also change the final appearance of plots.

# save the last plot we made.
pdf(file="p20.pdf", height=8, width=11)

# alternatively ggsave will save the last plot made
ggsave("p20.pdf", device="pdf")

# note the current working directory where this file will have been saved

Wide vs long format

In some cases, ggplot may expect data to be in ‘long’ instead of ‘wide’ format. This simply means that instead of each non-id variable having it’s own column there should be a column/columns designating key/value pairs. The long format is generally required when grouping variables, for example stacked bar charts. We can change between wide and long formats with the dcast() and melt() functions from the reshape2 package.

Consider the following example. The Orange dataset that is preloaded in your R install is in wide format and we can create a scatterplot of the data with the code below.

ggplot(Orange, aes(x=age, y=circumference)) + geom_point()

However if we want to group variables (stratify by some factor), such as when variables share a mapping aesthetic (i.e. using color to group variables age and circumference) we must convert the data to long format.

Orange2 <- melt(data=Orange, id.vars=c("Tree"))
ggplot(Orange2, aes(x=value, fill=variable)) + geom_density()

Extra tips and tricks

Almost done, in this last section, we will just mention a couple tips that you might find usefull. We’ll use the Orange2 dataset from above to illustrate. By default with large intergers such as genomic coordinates R will tend to display these in scientific notation. Many do not actually like this, you can commify axis values using the comma functions from the scales package as illustrated in the plot below. You will need to load the scales library for this to work.

ggplot(Orange2, aes(x=value, fill=variable)) + geom_density() + scale_x_continuous(labels=comma)

We’ve gone over how to set axis limits, but what if you have a faceted plot and you want very specific axis limits for each facet? Unfortunately applying an xlim() layer would apply to all facets. One way around this is to use “invisible data”. For the plot below we add an invisible layer to the alter the x-limits of the age facet.

ggplot(Orange2, aes(x=value, fill=variable)) + geom_density() + geom_point(data=data.frame(Tree=1, variable="age", value=3500), aes(x=value, y=0), alpha=0) + facet_wrap(~variable, scales="free")
Data Munging with Data.table | Griffith Lab

Genomic Visualization and Interpretations

Data Munging with Data.table

In the previous section we went through the basics of R, that will get you pretty far but often you will need to manipulate your data in some way to answer a biological question. In the R ecosystem there are three well known competing libraries in order to help in this regard. The first are the base R functions which we touched on a bit previously such as [] to manipulate a data.frame and aggregate() in order to apply some function to the data. A second option is the dplyr, part of the tidyverse which is much faster than the base R functions and has what many find to be a more intuitive syntax. The third option is the data.table library, this is what we will be going over in this section. It is extremly memory effecient and is consistently among the fastest solutions. throughout this course the base-r way will be shown alongside the data.table for comparison of syntax, these lines will be denoted with #!.

Let’s start by loading the library into R and reading in some data. To do this we will use fread() instead of the normal read.delim() fuction.

# load package

# read data
#! varDataDF <- read.delim("http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv")
varDataDT <- fread("http://genomedata.org/gen-viz-workshop/intro_to_ggplot2/ggplot2ExampleData.tsv")

If you were paying attention when you loaded the data.table library you might have noticed that data.table automatically recognized that your computer had multiple cores and registered them automatically. So when possible data.table will use multi-threading without you having to to anything. You’ll also note that the class of varData is both a data.table and data.frame. A nice thing about the data.table package is that all functions that work with data.frames should also work with the data.table class as well due to inheritance.

Before we really get started we should go over the overarching theme of data.table syntax, DT[i,j,by]. The i argument similar to dataframes, are on which rows to act. The j argument are what to do on columns, unlike dataframes we can do more than just select columns. Finally the by argument stands for group by what, something unique to data.tables. As we move along this syntax should start to make more sense.

Now let’s go over how to subset rows from the data.table structure. This will work mostly like how we would subset data.frames with a couple of key differences. 1. data.table will automatically assume you want to subset rows if there is no comma in the square brackets and 2. you can just supply the column name to data.table to subset via a boolean expression. Let’s have a look with some examples

# subset by row index
#! varDataDF[1:5,]

# extract all columns where chromosome is 22
#! varDataDF[varDataDF$chromosome_name == 22,]
varDataDT[chromosome_name == 22]

We can also select columns via a similar method to how we’re used to in data.frame, using character vectors or column indices, however the best practice is to simply use a list. the .() function in data.table is an alias for list().

# select out the chromosome_name column
#! varDataDF[,2]
#! varDataDF[,"chromosome_name"]
#! varDataDT[,"chromosome_name"]

In addition to selecting column names we can compute results directly inside a data.table from within the .() list of column names.

# find the mean tumor vaf
#! mean(varDataDF$tumor_VAF)

# equivalent to above but we put the result in a named column
#! data.frame("tumor_vaf"=mean(varDataDF$tumor_VAF))
varDataDT[,.(tumor_vaf = mean(tumor_VAF))]

# find the mean tumor vaf and mean tumor coverage
#! data.frame("tumor_mean_vaf"=mean(varDataDF$tumor_VAF), "tumor_mean_cov"=mean(varDataDF$tumor_var_count + varDataDF$tumor_ref_count))
varDataDT[,.(tumor_mean_vaf = mean(tumor_VAF), tumor_mean_cov=mean(tumor_var_count + tumor_ref_count))]

# we can apply this to select rows as well
#! data.frame("tumor_vaf"=mean(varDataDF[varDataDF$sample=="H_ML-08-0075-001-1127127","tumor_VAF"]))
varDataDT[sample=="H_ML-08-0075-001-1127127",.(tumor_vaf = mean(tumor_VAF))]

At this time it is appropriate to introduce a special symbol within the data.table package, the .N. It stands for the total number of rows within the data.table, essentially equivalent to the nrow() function. We can use this function to count filtered results.

# find how many entries chromosome 1 has
#! table(varDataDF[varDataDF$chromosome_name == "1","chromosome_name"], exclude = NULL)
varDataDT[chromosome_name=="1",.N, by=.(chromosome_name)]

Thus far we have dealt with topics that have some simalarity with data.frames, however let’s now dive into something unique to the data.table structure, grouping within data.tables. Essentially we can perform computations on columns by groups of values. For those familiar with base R this replaces the aggregate() function and can be used within a data.table call. Let’s look at some examples!

# count the number of entries per chromosome
#! as.data.frame(table(varDataDF[,"chromosome_name"], exclude = NULL))

# count the number of entries for strand and chromosome
#! as.data.frame(table(varDataDF[,c("chromosome_name", "strand")], exclude = NULL))
varDataDT[,.N,by=.(chromosome_name, strand)]

# count the number of entries per chromosome for the discovery set
#! as.data.frame(table(varDataDF[varDataDF$dataset=="discovery",c("chromosome_name")], exclude = NULL))

We’ve seen some examples on manipulating data.table objects so at this point let’s talk about a best practice feature of data.table, the ability to update a data.table object by reference. In current versions of R, 3.X.X at the time of this writing, when manipulating a column of a data frame the entire column get’s copied into internal memory, the column is updated, and is then added into the original data frame. This means that if you had a single column data frame taking up 2GB of memory and wanted to update it that operation would use 4GB. This is part of the reason R has a reputation as a memory hog. Fortunatley data.table has the option of updating columns by reference with the := operator. When using this no copy of the data is made but rather a reference is made mapping the old value to the new value. Let’s go over some examples for how it works. We will used the gc() function for a rough estimate of memory ussage as the ussual methods don’t account for garbage collection which would skew results in this case. See this link for an explanation as to why.

# create a data.table and data.frame for illustration purposes
myDF <- data.frame("V1"=c(1:10000), "V2"=c(1:10000))
myDT <- data.table("V1"=c(1:10000), "V2"=c(1:10000))

# profile the data.frame and data.table memory ussage for adding two columns
gc(myDF$V3 <- myDF$V1 + myDF$V2)
gc(myDT[,"V3" := V1 + V2])

# modify just a single value and profile
gc(myDF[1,"V1"] <- 100)
gc(myDT[1,"V1" := 100])

# did I mention you can modify by reference multiple columns at once
myDT[,c("rev_v1", "rev_v2") := .(rev(V1), rev(V2))]

Now that we’ve gone over the very basics for data.table i.e. the DT[i,j,by] syntax and how to assign columns on the fly, let’s introduce some more concepts to get a feel for exactly how powerfull data.table can be. To begin, let’s introduce the other special variables in data.table starting with .GRP. .GRP holds the grouping id from the by argument. Let’s say for for argument a bug was introduced in the code causing the variants on even chromosomes for the extension cohort to be 1 base off. How could you fix this in base-r succinctly, I actual don’t have a succinct answer for base R, but with data.table it’s fairly straight forward.

# see below for an explanation
varDataDT[order(chromosome_name, dataset), "group":=.GRP, by=.(chromosome_name, dataset)][group %% 4 == 0,"new_start":=start + 1]

So this is unlikely to happen but does allow us to introduce a couple concepts, some of which are new. To start things off we first act on rows by ordering by “chromosome_name” and “dataset” via order(chromosome_name, dataset) we also tell data.table to group by “chromosome_name” and “dataset” by=.(chromosome_name, dataset), and to assign that grouping to a new column called “group” "group":=.GRP. At this point every group which is divisible by 4 contains the value we wish to adjust so we chain the data.table we jsut created to another data.table expression simply by adding [] brackets. From there we can simply filter to groups which are divisible by 4 using the modulus operator group %% 4 == 0 and make a new column increasing the start position by 1 "new_start":=start + 1.

Okay, I know unrealistic example, but you get the idea, try doing what we did above with base-r code. Let’s go over another special variable with a more realistic application. The .SD variable stands for subset of data and essentially stores the subsets of data from the by argument as data.tables. It is commonly used with thhe .SDcols variable which specifys the columns to return for the data subsets. Let’s imagine that for each sample you need to return the first variant in varDataDT. Here’s the data.table way to do it.

varDataDT[order(chromosome_name, start), .SD[1], by=.(Simple_name), .SDcols=c("Simple_name", "chromosome_name", "start")]