Welcome to the blog

# Posts

My thoughts and ideas

Introduction to ProteinPaint | Griffith Lab

# Introduction to ProteinPaint

ProteinPaint is a tool made available as part of the PeCan Data Portal. The principle goal of this data portal is to facilitate exploration of childhood cancer genomics data. However, some tools, such as ProteinPaint are generally useful for visualizing the recurrence of any set of variants in a gene in the context of protein domains and other information.

This section will provide a brief introduction to ProteinPaint’s features and demonstrate its use with a few examples and exercises.

The tool is entirely web based. First navigate to the tool’s homepage: ProteinPaint. Note that it has its own tutorials.

### Guided tour of pre-loaded data

Go through the following exercise to explore the functionality of this resource:

• Choose an example gene to explore (e.g. EGFR). Enter it in the search box and select it from the drop down of suggestions.
• Select the gene name (top left) to view a summary of the gene.
• Use the CONFIG option (right side) to change between display modes by selecting switch display mode. Try the Splicing RNA option. Note these options are also available from the gene summary view mentioned above.
• Use the Tracks button to view the list of available tracks and turn on the RefGene track.
• Use the CONFIG menu to change switch display mode back to Protein. Also turn off the RefGene track.

• Select pre-loaded data to display (Pediatric, COSMIC, and ClinVar). Select COSMIC for this example. Use the ABOUT popup to learn more about each data source.
• Zoom In on a mutation hotspot and navigate around that region. Once done, Zoom Out x50 to return to a view of the entire gene.
• Under the More menu, try Select a region to highlight. Select a region containing a hotspot mutation. Change the highlight color using the same menu.
• Clear the highlighted region. Now use the + add protein domain button (bottom left) to add a custom domain of interest. e.g. MUTATION HOTSPOT ; 700 900; red
• Mouse over the gene diagram to view details on individual domains and amino acid positions.
• Select specific variants to learn more about what diseases they occur in. For example, load ERBB2 and the COSMIC data track. Two of the most common mutations are S310F and L755S. Select both of these. Which is more common in breast cancer? Use the List option to learn more about individual variant observations.

### Importing custom data

• Suppose we want to load a custom data set of our own variants into ProteinPaint. For illustration, we will use CIViCs list of VHL variants. To get that list go to www.civicdb.org, select SEARCH, select the Variants tab, create a query where Gene is VHL, and hit the Search button. Once the query completes, use the Get Data option to Download CSV. Save this file. Rename it to CIViC-VHL-Variants.csv. We’ll need to reformat the data before inporting it into ProteinPaint.

• Open this file in R and perform the clean-up as follows:

# Load the data downloaded from CIViC
dir()
x = read.csv(file = "CIViC-VHL-Variants.csv", as.is=1:4)

# Store only the variant names that we will parse for protein coordinates
vhl_variants1 = x[,2]

# Tidy up the names to remove the c. notations
vhl_variants2 = gsub("\\s+\$$.*\$$", "", vhl_variants1, perl=TRUE)

# Remove complex expressions beyond the "fs" in some variants
vhl_variants3 = gsub("fs.*", "fs", vhl_variants2, perl=TRUE)

# Limit to only those variants with a format like: L184P
vhl_variants4 = vhl_variants3[grep("^\\w+\\d+\\w+\$", vhl_variants3, ignore.case = TRUE, perl=TRUE)]

# Remove variants with an underscore
vhl_variants5 = vhl_variants4[grep("_", vhl_variants4, ignore.case = TRUE, perl=TRUE, invert = TRUE)]

# Store the variant names for later
vhl_variant_names = vhl_variants5

# Extract the amino acid position numbers
vhl_variant_positions = gsub("\\D+(\\d+)\\D+", "\\1", vhl_variants5, perl=TRUE)

# Create a variant types list
# M=MISSENSE, N=NONSENSE, F=FRAMESHIFT, I=PROTEININS, D=PROTEINDEL, S=SILENT
types = vector(mode = "character", length = length(vhl_variant_names))
types[1:length(vhl_variant_names)] = "M"
types[grep("\\*", vhl_variant_names, ignore.case = TRUE, perl=TRUE)] = "N"
types[grep("fs", vhl_variant_names, ignore.case = TRUE, perl=TRUE)] = "F"
types[grep("ins", vhl_variant_names, ignore.case = TRUE, perl=TRUE)] = "I"
types[grep("del", vhl_variant_names, ignore.case = TRUE, perl=TRUE)] = "D"

# Store the values we care about in a new data frame
vhl_variants_final = data.frame(vhl_variant_names, vhl_variant_positions, types)

# Create the final format strings requested for ProteinPaint of the form: R200W;200;M
format_string = function(x){
t = paste (x["vhl_variant_names"], x["vhl_variant_positions"], x["types"], sep = ";")
return(t)
}
output = apply(vhl_variants_final, 1, format_string)

# Write the output to a file
write(output, file="CIViC-VHL-Variants.formatted.csv")

• Open the resulting file (CIViC-VHL-Variants.formatted.csv) in a text editor.
• Then, in ProteinPaint load the VHL gene.
• Use the + button to add data (top center), and choose the SNV/Indel option.
• Then paste the formatted variant lines from the R cleanup exercise into that box and hit submit.
• Compare the VHL variants from CIViC to those in ClinVar.

### ProteinPaint practice exercises

What are the three most recurrent mutation in PIK3CA according to COSMIC?

Get a hint!

Load PIK3CA, activate the COSMIC track, and look for the mutations with highest patient counts

H1047R, E545K, and E542K are the most recurrent mutations in PIK3CA according to COSMIC

What is the top tissue of origin observed for each of these three mutations?

Get a hint!

Click on the circle for each mutation and examine the tissue distribution plot

H1047R (breast), E545K (large intestine), and E542K (large intestine)

Load the Pediatric data for RUNX1T1. (A) What special kind of variant is indicated? (B) Load the RNA-seq plot for these data. Mouse over the RUNX1 variant. What interesting pattern do you observe? (C) Highlight the top 25 samples in the RNA-seq expression plot. What type of cancer dominates?

Get a hint!

Load RUNX1T1, make sure the Pediatric data track is activated, and make sure the RNA-seq gene expression panel is open

(A) RNA gene fusion variants. (B) The RUNX1-RUNX1T1 (aka AML-ETO) fusion variant corresponds to samples with very high RUNX1T1 expression. (C) AML cancer dominates the top 25 samples with highest RUNX1T1 expression.

Repeat the exercise above where we extract variants from CIViC for KRAS, create a clean version of these data, and load them into ProteinPaint.

Get a hint!

You should be able to do almost exactly what we did with VHL, but for KRAS instead

Advanced exercise. Identify a set of variants from your own gene for a single gene. Repeat the exercise above using these variants. If they are not human variants, it may be possible to first identify the closest human ortholog, and second to do “lift over”” of the coordinates.

Get a hint!

Depending on the form of the variants, some different kind of parsing and reformatting may be needed. If you need to convert the variants from one species to another you will learn more about tools for identifying orthologs and performing liftovers in later sections of this workshop. You may want to come back to this exercise...

Introduction to liftover tools | Griffith Lab

# Introduction to liftover tools

A common analysis task is to convert genomic coordinates between different assemblies. Probably the most common situation is that you have some coordinates for a particular version of a reference genome and you want to determine the corresponding coordinates on a different version of the reference genome for that species. For example, you have a bed file with exon coordinates for human build GRC37 (hg19) and wish to update to GRCh38. Many resources exist for performing this and other related tasks. In this section we will go over a few tools to perform this type of analysis, in many cases these tools can be used interchangeably. This post is inspired by this BioStars post (also created by the authors of this workshop).

### Reference assemblies

First let’s go over what a reference assembly actually is. A reference assembly is a complete (as much as possible) representation of the nucleotide sequence of a representative genome for a specific species. These assemblies provide a powerful shortcut when mapping reads as they can be mapped to the assembly, rather than each other, to piece the genome of a new individual together. This has a number of benefits, the most obvious of which is that it is far more effecient than attempting to build a genome from scratch. By its very nature however using this approach means there is no perfect reference assembly for an individual due to polymorphisms (i.e. snps, hla-type, etc.). Furthermore, due to the presence of repetitive structural elements such as duplications, inverted repeats, tandem repeats, etc. a given assembly is almost always incomplete, and is constantly being improved upon. This leads to the publication of new assembly versions every so often such as grch37 (Feb. 2009) and grch38 (Dec. 2013) for the Human Genome Project. It is also important to be aware that different organizations can publish different reference assemblies, for example grch37 (NCBI) and hg19 (UCSC) are identical save for a few minor differences such as in the mitochondria sequence and naming of chromosomes (1 vs chr1). For a nice summary of genome versions and their release names refer to the Assembly Releases and Versions FAQ.

### Resources available

There are many resources available to convert coordinates from one assemlby to another. We will go over a few of these. However, below you will find a more complete list. The UCSC liftOver tool is probably the most popular liftover tool, however choosing one of these will mostly come down to personal preference.

• UCSC liftOver: This tool is available through a simple web interface or it can be downloaded as a standalone executable. To use the executable you will also need to download the appropriate chain file. Each chain file describes conversions between a pair of genome assemblies. Liftover can be used through Galaxy as well. There is a python implementation of liftover called pyliftover that does conversion of point coordinates only.

• NCBI Remap: This tool is conceptually similar to liftOver in that it manages conversions between a pair of genome assemblies but it uses different methods to achieve these mappings. It is also available through a simple web interface or you can use the API for NCBI Remap.

• The Ensembl API: The final example I described above (converting between coordinate systems within a single genome assembly) can be accomplished with the Ensembl core API. Many examples are provided within the installation, overview, tutorial and documentation sections of the Ensembl API project. In particular, refer to these sections of the tutorial: ‘Coordinates’, ‘Coordinate systems’, ‘Transform’, and ‘Transfer’.

• Assembly Converter: Ensembl also offers their own simple web interface for coordinate conversions called the Assembly Converter.

• rtracklayer: For R users, Bioconductor has an implementation of UCSC liftOver in the rtracklayer package.

• CrossMap: A standalone open source program for convenient conversion of genome coordinates (or annotation files) between different assemblies. It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF. CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome. Not recommended for converting genome coordinates between species.

• Flo: A liftover pipeline for different reference genome builds of the same species. It describes the process as follows: “align the new assembly with the old one, process the alignment data to define how a coordinate or coordinate range on the old assembly should be transformed to the new assembly, transform the coordinates.”

### Using UCSC liftOver

Sex linkage was first discovered by Thomas Hunt Morgan in 1910 when he observed that the eye color of Drosophila melanogaster did not follow typical Mendelian inheritance. This was discovered to be caused by the white gene located on chromosome X at coordinates 2684762-2687041 for assembly dm3. Let’s use UCSC liftOver to determine where this gene is located on the latest reference assembly for this species, dm6. First navigate to the liftOver site at https://genome.ucsc.edu/cgi-bin/hgLiftOver and set both the original and new genomes to the appropriate species, “D. melanogaster”.

We want to transfer our coordinates from the dm3 assembly to the dm6 assembly so let’s make sure the original and new assemblies are set appropriately as well.

Finally we can paste our coordinates to transfer or upload them in bed format (chrX 2684762 2687041).

The page will refresh and a results section will appear where we can download the transferred cordinates in bed format.

Try and compare the old and new coordinates in the UCSC genome browser for their respective assemblies, do they match the same gene?

Yes, both coordinates match the coding sequence for the w gene from transcript CG2759-RA

### Using rtracklayer

In another situation you may have coordinates of a gene and wish to determine the corresponding coordinates in another species. This is a common situation in evolutionary biology where you will need to find coordinates for a conserved gene across species to perform a phylogenetic analysis. Let’s use the rtracklayer package on bioconductor to find the coordinates of the H3F3A gene located at chr1:226061851-226071523 on the hg38 human assembly in the canFam3 assembly of the canine genome. To start install the rtracklayer package from bioconductor, as mentioned this is an R implementation of the UCSC liftover.

# install and load rtracklayer
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rtracklayer", version = "3.8")

library(rtracklayer)


The function we will be using from this package is liftover() and takes two arguments as input. The first of these is a GRanges object specifying coordinates to perform the query on. This class is from the GenomicRanges package maintained by bioconductor and was loaded automatically when we loaded the rtracklayer library. The second item we need is a chain file, which is a format which describes pairwise alignments between sequences allowing for gaps. The UCSC website maintains a selection of these on it’s genome data page. Navigate to this page and select “liftOver files” under the hg38 human genome, then download and extract the “hg38ToCanFam3.over.chain.gz” chain file.

Next all we need to do is to create our GRanges object to contain the coordinates chr1:226061851-226071523 and import our chain file with the function [import.chain()]. We can then supply these two parameters to liftover().

# specify coordinates to liftover
grObject <- GRanges(seqnames=c("chr1"), ranges=IRanges(start=226061851, end=226071523))

# import the chain file
chainObject <- import.chain("hg38ToCanFam3.over.chain")

# run liftOver
results <- as.data.frame(liftOver(grObject, chainObject))


How many different regions in the canine genome match the human region we specified?

210, these return the ranges mapped for the corresponding input element

### Exercises

Try to perform the same task we just complete with the web version of liftOver, how are the results different?