Welcome to the blog

# Posts

My thoughts and ideas

Introduction to copy number frequency plots | Griffith Lab

# Introduction to copy number frequency plots

A common task in any bioinformatic analysis of next generation sequencing data is the the determination of copy number gains and losses. The cnFreq() function from the GenVisR package is designed to provide a summary level view of copy number calls for a cohort of cases. It uses a barchart like plot to display the proportion or frequency of CN gains or losses observed across the genome. In this section we will use cnFreq() to explore the copy number frequencies and proportions for 10 her2 positive breast cancer samples from the manuscript “Genomic characterization of HER2-positive breast cancer and response to neoadjuvant trastuzumab and chemotherapy-results from the ACOSOG Z1041 (Alliance) trial.”

The data we will be working with was generated with the R package Copycat2. The output of this program consists of a file containing segmented copy number calls. You can find these files on http://genomedata.org/gen-viz-workshop/GenVisR/, go ahead and download all files with a .cc2.tsv extension. Once these files are downloaded we will need to read them into R and coerce them into a single data frame with column names “chromosome”, “start”, “end”, “segmean”, and “sample”. As a first step install and load the stringr package we’ll need this to make some of the string manipulation we’ll be doing easier. Once stringr is loaded run through the rest of the R code below, we’ll explain step by step what’s going on in the paragraph below.

# load extra libraries
# install.packages("stringr")
library("stringr")

# get locations of all cn files
files <- Sys.glob("~/Desktop/*cc2.tsv")

# create function to read in and format data
a <- function(x){
# read data and set column names
colnames(data) <- c("chromosome", "start", "end", "probes", "segmean")

# get the sample name from the file path
sampleName <- str_extract(x, "H_OM.+cc2")
sampleName <- gsub(".cc2", "", sampleName)
datasample <- sampleName # return the data return(data) } # run the anonymous function defined above cnData <- lapply(files, a) # turn the list of data frames into a single data frame cnData <- do.call("rbind", cnData)  After the stringr package is loaded we need to create a vector of all copy number files we downloaded, for this we use the function Sys.glob(). Once we have our vector of copy number files we can create an anonymous function, a, to read data into R from a file path. In this function we first use read.delim() to read in our data, remember to set the parameter header=F as these files have no header. With the file read in as a data frame we can then use the function colnames() to set the appropriate column names. All of the required column names are already present save for one, the sample name. We use the str_extract() function from the stringr package along with a call to gsub() to pull this information from the file name and add this as another column. We use the lapply() function to run the anonymous function we defined on each file path giving us a list of data frames with copy number data read in. Finally we use the function do.call() to execute the function rbind() each data frame in the list giving us a single data frame of copy number values. ### Creating an initial plot Now that we have our copy number calls loaded into R we can call the cnFreq() on our data set. We will also need to give a genome assembly via the parameter genome. This expects a character string specifying one of “mm9”, “mm10”, “hg19”, “hg38”, and “rn5” and is used to ensure that the entire chromosomes for the genome is plotted. In the next version of GenVisR this function will be expanded to allow input of custom genome boundaries if needed. # call the cnFreq function cnFreq(cnData, genome="hg19")  As we can see from the plot above our cohort exhibits a strong amplification signal across much of the genome, particularly chromosomes 8, 17, and 20 where over half of samples have some sort of copy number gain. You might be asking what in cnFreq() constitutes as a copy number gain or loss. This is controlled with the parameters CN_low_cutoff and CN_high_cutoff with calls ≤ 1.5 and ≥ 2.5 being evaluated as losses and gains respectively. This setting is reasonable for most situations in which the genome is diploid however if the data is noisy or has a different ploidy these parameters may need to be altered. We can observe the effect this has by setting these cutoffs such that all copy number calls would be considered amplifications. Go ahead and add CN_low_cutoff = 0, and CN_high_cutoff = .1, as expected we can see that almost the entire plot shows amplifications with a few exceptions where samples are simply misssing data for that region. # change the CN cutoffs cnFreq(cnData, genome="hg19", CN_low_cutoff = 0, CN_high_cutoff = .1)  The cnFreq() function is much faster when all copy number segments have the same coordinates across all samples however this condition is not a requirement to run cnFreq(). It is important to talk about why and what goes on behind the scenes when this condition is not met as it may affect how we interpret the results. When segment coordinates are not identical cnFreq() will perform a disjoin operation and attempt to split segment coorinates so that they are identical across samples populating each coordinate with the appropriate copy number call. This is best illustrated in the two sample figure below. This operation may come with a loss of accuracy however as copy number calls are ussually made using the mean of many calls within that segment and splitting these segment up can therefore alter what the mean value of the copy number call would have been. ### highlighting plot regions Let’s go back to our initial plot, what if we wanted to view a specific region of the genome such as ERBB2 (chr17:39,687,914-39,730,426) as these are HER2+ breast cancers. the function cnFreq() has the ability to zoom into specific chromosomes with the parameter plotChr and we have the ability to add additional layers onto our plot with the parameter plotLayer. Let’s go ahead and view just chromosome 17 and add in a vertical line where ERBB2 resides. As expected our vertical line intersects at a region where over 75% of samples are amplified. We can verify this by outputing the actual data calculations instead of the plot with the parameter out="data". # highlight ERBB2 library(ggplot2) layer1 <- geom_vline(xintercept=c(39709170)) cnFreq(cnData, genome="hg19", plotChr="chr17", plotLayer=layer1)  ### Exercises In almost all GenVisR functions the data frames which are used to produce plots can be extracted using the parameter out=data. This feature is not only usefull for looking at specific data values but also for infering what ggplot2 function calls or happening within GenVisR behind the scenes. With a bit of knowledge regarding ggplot2 we can overwrite core plot layers to do something specific. In the plots above we can tell that the function facet_grid() is being called just from looking at the plot. We can also see that facet_grid() is setting scales for each facet based on the size of each chromosome. Try overwriting this layer to remove this behavior with the parameter plotLayer. Your plot should look something like the one below. Get a hint! The parameter you will need to alter in facet_grid() is "scales=" Answer facet_grid(. ~ chromosome, scales="free") Introduction to gene coverage plots+ | Griffith Lab ## Genomic Visualization and Interpretations # Introduction to gene coverage plots+ Often is is usefull to view coverage of a specific region of the genome in the context of specific samples. During initial stages of analysis this can be done with a genome browser such as IGV however when preparing a publication more fine grain control is usefull. For example you may wish to change the coverage scale, reduce the size of introns, or visualize many samples at once. GenVisR provides a function for this aptly named genCov(). ### Introduction to datasets In this section we will be using coverage data derived from two mouse samples from the study “Truncating Prolactin Receptor Mutations Promote Tumor Growth in Murine Estrogen Receptor-Alpha Mammary Carcinomas”. We will be showing that the knockout of the STAT1 gene described in the manuscript was successful. In order to obtain the preliminary data we used the command bedtools multicov -bams M_CA-TAC245-TAC245_MEC.prod-refalign.bam -bed stat1.bed to obtain coverage values for the wildtype TAC245 sample and the tumor free knockout TAC265 sample. Go ahead and download the output of bedtools multicov from here and load it into R. # read the coverage data into R covData <- read.delim("STAT1_mm9_coverage.tsv")  ### Formating coverage data Before we get started we need to do some preliminary data preparation to use genCov(). First off genCov() expects coverage data to be in the form of a named list of data frames with list names corresonding to sample id’s and column names “chromosome”, “end”, and “cov”. Further the chromosome column should be of the format “chr1” instead of “1”, we’ll explain why a bit later. Below we rename our data frame columns with colnames() and create an anonymous function, a, to go through and split the data frame up into a list of data frames by sample. We then use the function names() to assign samples names to our list. # rename the columns colnames(covData) <- c("chromosome", "start", "end", "TAC245", "TAC265") # create a function to split the data frame into lists of data frames samples <- c("TAC245", "TAC265") a <- function(x, y){ col_names <- c("chromosome", "end", x) y <- y[,col_names] colnames(y) <- c("chromosome", "end", "cov") return(y) } covData <- lapply(samples, a, covData) names(covData) <- samples  # loading a BSgenome and TxDb object The next set of data we need is an object of class BSgenome, short for biostrings genome. These object are held in bioconductor annotation packages and store reference sequences in a way that is effecient for searching a reference. To view the available genomes maintained by bioconductor you can either install the BSgenome package and use the function available.genomes() or you can use biocViews on bioconductors website. We also need to load in some transcription meta data corresponding to our data. This type of data is also stored on bioconductor and is made available through annotation packages as TxDb objects. You can view the available TxDb annotation packages using biocViews. In our situation the coverage data for our experiment corresponds to the mm9 reference assembly, so we will load the BSgenome.Mmusculus.UCSC.mm9 and TxDb.Mmusculus.UCSC.mm9.knownGene libraries. It is important to note that the chromosomes between our data sets must match, UCSC appends “chr” to chromosome names and because we are using these libraries derived from UCSC data we must do the same to our coverage input data. # install and load the BSgenome package and list available genomes # source("https://bioconductor.org/biocLite.R") # biocLite("BSgenome") library("BSgenome") available.genomes() # install and load the mm9 BSgenome from UCSC # source("https://bioconductor.org/biocLite.R") # biocLite("BSgenome.Mmusculus.UCSC.mm9") library("BSgenome.Mmusculus.UCSC.mm9") genomeObject <- BSgenome.Mmusculus.UCSC.mm9 # Install and load a TxDb object for the mm9 genome # source("https://bioconductor.org/biocLite.R") # biocLite("TxDb.Mmusculus.UCSC.mm9.knownGene") library("TxDb.Mmusculus.UCSC.mm9.knownGene") TxDbObject <- TxDb.Mmusculus.UCSC.mm9.knownGene  # Creating a Granges object The final bit of required data we need is an object of class GRanges, short for “genomic ranges”. This class is a core class maintained by bioconductor and is the preferred way for storing information regarding genomic positions. In order to use this class we need to install the GenomicRanges package from bioconductor and use the GRanges() constructor function. Let’s go ahead and define a genomic location within this class corresponding to our coverage data. # get the chromosome, and the start and end of our coverage data chromosome <- as.character(unique(covData[[1]]chromosome))
start <- as.numeric(min(covData[[1]]$end)) end <- as.numeric(max(covData[[1]]$end))

# define the genomic range
grObject <- GRanges(seqnames=c("chr1"), ranges=IRanges(start=start, end=end))


# Creating an initial coverage plot

Now that we have all the basic inforamation we need let’s go ahead and pass all of this information to genCov() and get an initial coverage plot. We will also use the parameter cov_plotType so that the coverage is shown as a line graph, this is a matter of personal preference and may not be suitable is the coverage values vary widly from base to base. Other values cov_plotType accepts are “bar” and “point”.

# create an initial plot
genCov(x=covData, txdb=TxDbObject, gr=grObject, genome=genomeObject, cov_plotType="line")


We now have an initial plot, let’s start talking about what is actually being displayed. The first thing you might have noticed is the gene track. Here we have representations for every genomic feature from the TxDb object encompassed by the genomic ranges we specified in grObject. We can see that there are 6 STAT1 isoforms, we can further view the gc content proportion for each feature in this region. You might have noticed a warning message as well, something along the lines of Removed 3 rows containing missing values (geom_text)., and we can see that not all of our isoforms are labled. If you’re wondering what’s going on we specified a genomic range right up to the gene boundaries for a few of these isoforms. Because genCov() is running out of plotting space to place these labels it, or rather ggplot2 removes them. Let’s go ahead and fix that by adding a flank to our genomic range, don’t worry about changing covData, genCov() will compensate and make sure everything still aligns. Let’s also only plot the canonical STAT1 isoform using the gene_isoformSel parameter, looking at the genome browser from UCSC we can see that this is “uc007aya.1”.

# add a flank and create another plot
grObject <- GRanges(seqnames=c("chr1"), ranges=IRanges(start=start-500, end=end+500))
genCov(x=covData, txdb=TxDbObject, gr=grObject, genome=genomeObject, cov_plotType="line", gene_isoformSel="uc007aya.1")


The second thing you may have noticed regarding these plots is that the space between types of genomic features (“cds”, “utr”, “introns”) seems off. In order to plot what is likely the most relevant data genCov() is scaling the space of each of these feature types. By default “Intron”, “CDS”, and “UTR” are scaled by a log factor of 10, 2, and 2 respectively. This can be changed by altering the base and transform parameters which correspond to each other. Alternatively setting these parameters to NA will remove the scaling entirely. Let’s go ahead and do that now just to get a sense of how things look without scaling.

# adjust compression of genomic features
genCov(x=covData, txdb=TxDbObject, gr=grObject, genome=genomeObject, cov_plotType="line", gene_isoformSel="uc007aya.1", base=NA, transform=NA)


### Exercises

From the plots we produced above we can clearly see the first 3 exons of this isoform have been knocked out successfully in the “TAC265” sample and the “TAC245 Wild Type” sample is unchanged. The genCov() function also has many parameters to customize how the plot looks. Try viewing the documentation for genCov() and use these parameters to produce the plot below, remember documentation can change between version so type ?genCov in your R terminal for documentation specific to the version you have installed.

Get a hint!

Have a look at the following parameters: label_bgFill, label_txtFill, cov_colour, gene_colour