Welcome to the blog

# Posts

My thoughts and ideas

Introduction to loss of heterozygosity plots | Griffith Lab

# Introduction to loss of heterozygosity plots

We’ve gone through visualizations of point mutations and copy number changes using GenVisR, another type of genomic alteration that is often usefull to visualize are “Loss of Heterozygosity” events. In a diploid cell there are pairs of chromosomes each containing a single copy of the genome. These pairs each come from haploid gametes and are slightly different from each other leading to heterozygosity throughout much of the genome. Situations can arise however where this inherit heterozygosity of the genome is loss, commonly this is through deletions of a parental copy within a chromosome region also known as hemizygosity. Viewing deletions however will not give a complete picture of LOH as events can arise that will lead to copy-neutral LOH, for example if a parental copy was deleted but the then remaining copy underwent an amplification. In this section we will use the GenVisR function lohSpec() created specifically for viewing loh events.

### How lohSpec works

The function lohSpec() works by using a sliding window approach to calculate the difference in the variant allele fractions (VAF) between heterozygous variants in matched tumor normal samples. Essentially, a window of a specified size (defined by the parameter window_size) will obtain the absolute difference between the tumor VAF and the normal VAF, which is assumed to be .5, for each genomic position within the window’s position. All of these points are then averaged to obtain a measure of LOH within the window. This window will then move forward by a length specified by the parameter step, and again cauclate this absolute mean tumor-normal VAF difference metric. This LOH value across all overlapping windows is averaged.

### Introduction to demonstration dataset

For this section we will be starting from a file of variants generated from the variant calling algorithm varscan. These variants originate from the breast cancer cell line HCC1395 aligned to “hg19” and were lightly processed to limit to only events called as “Germline” or “LOH” by varscan. You can download the variant file from this link. Once you have the file we’ll need to do some minor reformatting as described below to make it compatiable with lohSpec().

# read in the varscan data

# grab only those columns which are required and name them
lohData <- lohData[,c("V1", "V2", "V7", "V11")]
colnames(lohData) <- c("chromosome", "position", "n_vaf", "t_vaf")

lohData$sample <- "HCC1395" # convert the normal and tumor vaf columns to fractions lohData$n_vaf <- as.numeric(gsub("%", "", lohData$n_vaf))/100 lohData$t_vaf <- as.numeric(gsub("%", "", lohData$t_vaf))/100 # limit to just the genome of interest lohData <- lohData[grepl("^\\d|X|Y", lohData$chromosome),]


The function lohSpec() requires a data frame of heterozygous variant calls with column names “chromosome”, “position”, “n_vaf”, “t_vaf”, and “sample” as it’s primary input. Above we read in our germline variant calls from varscan and subset the data to just the required columns renaming them with colnames() and addin in a “sample” column. If we look at the the documentation we can see that lohSpec() expects proportions for the VAF columns and not percentages. We coerce these columns into this format by using gsub() to remove the “%” symbol and dividing the resultant numeric value from the call as.numeric() by 100. We also subset our data using grepl() to remove anything that is not a canonical chromosome as the chromosomes in the primary data must match those specified in the genome parameter we will use later.

# Creating an initial plot

Like it’s counterpart cnSpec(), lohSpec() needs genomic boundaries to be specified in addition to the primary data in order to ensure that the entire genome is plotted when creating the graphic. This can be done via the parameter genome which expects a character string specifying one of “hg19”, “hg38”, “mm9”, “mm10”, or “rn5”. Alternatively a custom genome can be supplied with the y parameter. These parameters are both identical to their cnSpec() counterparts, please refer to the cnSpec tutorial for a more detailed explanation. Now that we have all of the required data let’s go ahead and make a standard lohSpec() plot.

# Create an inital plot
lohSpec(lohData, genome="hg19")


We have our first plot however you might have noticed a warning message along the lines of “Detected values with a variant allele fraction either above .6 or below .4 in the normal. Please ensure variants supplied are heterozygous in the normal!”. Admittedly this warning message is accurate, let’s take a minute to consider why biologically lohSpec() is producing this warning message.

Get a hint!

Refer to the "How lohSpec works" section above

A heterozygous call in a diploid genome should have a VAF of .5, any deviation from this is either noise due to insufficient coverage, or more problematically indicative that the site is not completely heterozygous. A reasonable explanation could be a focal amplification at that site on one allele. In any case it is inadvisable to use such sites when constructing this type of plot

Now let’s fix the issue and reproduce our plot, note that as expected the warning message has disappeared.

# Obtain variants with a VAF greater than 0.4 and below 0.6.
lohData <- lohData[lohData$n_vaf > 0.4 & lohData$n_vaf < 0.6,]

# run lohSpec
lohSpec(x=lohData)


Now that we have a plot let’s go over what it is showing. First off what is a high LOH value. Remember the formula used to calculate LOH for a single variant is |normal vaf - tumor vaf| and that these values are averaged for a given window. Consider this example of LOH then |.5 - 1 | = .5 and |.5 - 0| = .5. Conversely consider this example of non LOH, | .5 - .5 | = 0. So we can see that the closer we get to the value .5 the more evidence there is that an LOH event exists. We have to take noise into account but we can see that the majority of this genome has some sort of LOH, probably not surprising as this data is derived from an imortalized cell line.

### Altering the step and window size

At this point it is appropriate to talk about the trade off between speed and accuracy from setting the parameters step and window_size. We have already briefly discussed these parameters and what they do. What has so far gone unsaid is that these parameters really control the amount of smoothing the data undergoes and as such altering one will alter the trade off between the algorithms speed and an accurate representation of the data, this is especially true for the step parameter. We can view the effect of this using the microbenchmark package, increasing the step by a factor of 2/3 will decrease the computation time by almost half. Reasonable defaults have been chosen for the human genome however one should keep these parameters in mind when using a custom genome or when attempting to plot many samples.

# install and load a benchmarking package
# install.packages("microbenchmark")
library(microbenchmark)

# run benchmark tests
microbenchmark(lohSpec(x=lohData, window_size = 2500000, step = 1000000), lohSpec(x=lohData, window_size = 2500000, step = 1500000), times = 5L)


### Exercises

There may be situations in which you would want to view only specific regions within the genome instead of the whole genome itself. This can be achieved by supplying a custom genome to the parameter y and making sure your input data is limited to only that region. The gene PTEN is commonly lost in breast cancer, take a look to see if it’s lost in this cell line. Limit your data to only chromosome 10 and use plotLayer to highlight the q23.31 cytogenetic band on which PTEN resides (chr10:89500000-92900000). Your plot should look something like the one below.

Get a hint!

You'll need to create a custom genome that is only chromosome 10, and you'll need to subset your input data as well.

Get a hint!

Look at the messages lohSpec() outputs, it automatically prepends chr to the input to x so your custom genome will need chr prepended as well

Get a hint!

Try using geom_vline() to highlight q23.1

The solution is supplied in this file.

At times it may be desireable to alter how the plot looks, lohSpec() has a few helpfull parameters to aid in this but it may also be necessary to add additional plot layers as well via the parameter plotLayer. try to recreate the plot below using a combination of parameters in the lohSpec() documentation and adding additional layers via plotLayer. You will need to alter the plot colours, add a title, and change the facets to take up an equal share of the plot.

Get a hint!

look at the parameter "colourScheme"

Get a hint!

you will need to add layers for facet_grid() and ggtitle()

Get a hint!

Remember from earlier that when multiple layers are supplied they must be as a list!

lohSpec(lohData, colourScheme = "viridis", plotLayer=list(facet_grid(.~chromosome, scales="free"), ggtitle("Loss of Heterozygosity")))

When plotting these types of plots, particular care must be taken when dealing with allosomes. The parameter gender is usefull in this regard however try and think why it is necessary to consider for these plots and why lohSpec() does not plot allosomes by default.

Depending on the gender, the allosomes of a sample could be diploid or haploid, if the later is true LOH could be deceptively displayed when there is not any.

Introduction to copy number spectrum plots | Griffith Lab

# Introduction to copy number spectrum plots

A common task in any bioinformatic analysis of next generation sequencing data is the the determination of copy number gains and losses. The cnSpec() function, short for “copy number spectrum”, from the GenVisR package is designed to provide a view of copy number calls for a cohort of cases. It is very similar to the cnFreq() function discussed in the previous section but displays per sample copy number changes in the form of a heatmap instead of summarizing calls. In this section we will use cnSpec() to explore the copy number calls from the same data set of her2 positive breast cancer samples used in the previous section.

The data we will be working with was generated with the R package Copycat2. The output of this program consists of a file containing segmented copy number calls. You can find these files on http://genomedata.org/gen-viz-workshop/GenVisR/, go ahead and download all files with a “.cc2.tsv” extension. Once these files are downloaded we will need to read them into R and coerce them into a single data frame with column names “chromosome”, “start”, “end”, “segmean”, and “sample”. As a first step install and load the stringr package we’ll need this to make some of the string manipulation we’ll be doing easier. Once stringr is loaded run through the rest of the R code below the details of which can be found in the previous section.

# load extra libraries
# install.packages("stringr")
library("stringr")

# get locations of all cn files
files <- Sys.glob("~/Desktop/*cc2.tsv")

# create function to read in and format data
a <- function(x){
# read data and set column names
colnames(data) <- c("chromosome", "start", "end", "probes", "segmean")

# get the sample name from the file path
sampleName <- str_extract(x, "H_OM.+cc2")
sampleName <- gsub(".cc2", "", sampleName)
genomeBoundaries$chromStart <- 0 colnames(genomeBoundaries) <- c("chromosome", "end", "start") # create the plot cnSpec(cnData, y=genomeBoundaries)  # The genome parameter These plots are fairly straightforward, but it is helpful to know what the genome and y parameters actually do. As eluded to in the previous section only one of these is required, as they both do the same thing, define genome boundaries. This is done to ensure that if you only had data for one section of a chromosome the entire chromosome space is still plotted. To get a sense of what is actually happening let’s add a flank to genomeBoundaries and see what happens. You’ll see in the plot below that all the chromosomes plotted below now have a padding where no data is plotted. # add a flank to the genomic coordinates genomeBoundaries_2 <- genomeBoundaries genomeBoundaries_2$start <- genomeBoundaries_2$start - 1e8 genomeBoundaries_2$end <- genomeBoundaries_2$end + 1e8 # create a plot cnSpec(cnData, y=genomeBoundaries_2)  ### Exercises There are times when a genome wide view of copy number changes is not necessary or desireable, for example we may want to view just chromosome 17 to look for ERBB2 amplifications as these are HER2+ breast cancer samples. We can achieve this by creating a custom genome to input into the y parameter. Try using what you know to only plot chromosome 17 and use geom_vline() to highlight were ERBB2 is located. You can pass additional plot layers to cnSpec() with the param plotLayer. Your plot should look something like the one below. Get a hint! You'll need to create a custom genome that is only chromosome 17, you'll need to subset your input data as well. Answer cnSpec(cnData[cnData$chromosome=="17",], y=genomeBoundaries[genomeBoundaries\$chromosome=="chr17",], plotLayer=geom_vline(xintercept = 39709170, colour="seagreen4", size=1, linetype=2))

You could of course add whichever layers you wish to change any aspect of the plot’s we’ve been producing, however you will often find parameters within GenVisR functions in order to make the most common changes. Try reading the documentation for cnSpec() and create a version of the plot below.

Get a hint!

One of the parameters is "CN_Loss_colour"