Welcome to the blog

Posts

My thoughts and ideas

Genome Browsing and Visualization - UCSC | Griffith Lab

Genomic Visualization and Interpretations

Genome Browsing and Visualization - UCSC

The UCSC genome browser is a powerful web application for exploring the genomes of a variety of organisms in the context of a rich set of annotation tracks. Let’s start by navigating to the UCSC genome browser homepage at http://genome.ucsc.edu and clicking on Genome Browser. Over the course of this tutorial we will be highlighting features with transparent (pink or green) textbox, please pay attention to these as they are relevant to the discussion.

Selecting an organism and assembly

This will take us to the browser gateway where we can select the organism we wish to view as well as the assembly for that organism. For this tutorial we will be using the Febuary 2009 assembly of the human genome (GRCh37/hg19), let’s go ahead and select that from the Human Assembly dropdown menu and click on “GO”.

We are now in the genome browser for our chosen reference assembly. There is a lot of information here, but let’s start with the basics, navigating around. We can jump to a position or gene by entering them in the highlighted text box below, let’s jump to PIK3CA which has coordinates chr3:178,866,311-178,952,497. After jumping to PIK3CA we can see in the image below that PIK3CA is 86,187 base pairs long and resides on the q arm of chromosome 3 (green text boxes). We can shift our viewing window with the arrow keys on the top left and we can zoom in and out using the zoom buttions on the right (red text boxes). In addition clicking and dragging inside the coordinate track within the browser will zoom to the window highlighted by such an action. Performing this action within any of the other tracks will shift the viewing window left or right. These are analogous to using the buttons mentioned above but allow for more user control.

Genome browser tracks

One of the features that makes the genome browser so useful are the tracks displayed in the context of the genome assembly. Users can add and remove tracks by using the buttons below the genome browser. Clicking on the name of a track will allow for more fine grain control of that track and give additional information regarding that track such as source, citation and version numbers. A number of tracks are turned on by default, but let’s go ahead and hide all tracks to get a better sense of what’s going on, click on the “hide all” button below the genome browser.

The browser should now be empty of tracks, let’s go ahead and add the ensembl gene track, find Ensembl Genes under “Genes and Gene Predictions” and click on the link. You should see a page similar to the one below. Go ahead and set the Display mode to “full”, turn on “Color track by codons” and then hit “Submit”. This will turn on Ensembl gene annotations using Ensembl version 75.

As is exhibited below we can now see the full list of ensembl transcipts of PIK3CA based on the ensembl version 75 annotations of which there are 5. You can add as many tracks onto the viewer as you want but note that the available tracks will differ between species and even between assemblies.

Turn on the GTEx Gene track. Which tissue/cell type has the highest expression of PIK3CA?

Get a hint!

Try clicking on the track in the genome browser once it's enabled

Answer

Cells - EBV-transformed lymphocytes have the highest average expression at 11.3 RPKM.

BLAST-like alignment tool

Our discussion of the genome browser would not be complete without mentioning the BLAST-like alignment tool commonly refered to as BLAT. As the name would suggest BLAT works similarly to BLAST, however it works in conjunction with the UCSC genome browser and aligns only to the specified genome assembly. Let’s use BLAT to look at which section of the hg19 genome assembly the first primer pair in Supplemental Table S2 is amplifying, from the paper: “Non-exomic and synonymous variants in ABCA4 are an important cause of Stargardt disease.”. The forward primer used in this experiment is TTTCTGAAATTGGGATGCAG and the reverse primer is GTTTTCCCAGGCAGAACAGA. Go ahead and input these primers into BLAT in fasta format. Make sure to select the human hg19 reference assembly, then go ahead and click on submit.

BLAT will search the genome and output a table of possible matches between the genome assembly and the primer sequences. Unsurprisingly there are only two entries, one for each primer and each with 100% identity. Go ahead and click on “browser” to view one of these on the UCSC genome browser. The second primer should be close by, try and zoom out to find it.

Which gene do these primers attempt to amplify?

Answer

ABCA4

UCSC Table Browser

BLAT allowed us to get to a region based on a sequence, but what if we want to do the reverse of that, to get a sequence based on a region. Fortunately the UCSC genome browser has a tool for that as well called the UCSC Table Browser. Let’s get the DNA sequence for the gene PRLR using this tool. First navigate to the UCSC Table Browser (under Tools). There are a lot of options here but we only need to concern ourselves with a few. First make sure that the proper assembly is specified, in our case this should be hg19 (red boxes). Next because we want to return the sequence of the PRLR gene “Genes and Gene Predictions” should be selected under group. Let’s go ahead and look for this gene in the “UCSC Genes” track (green boxes). Further the table should be set to “knownGene” (blue box).

Next, click on “paste list” under “identifiers (names/accessions)” (see arrow in the figure above) and specify that PRLR is the gene we want the sequence for, and click “Submit”.

Back at the main table browser view, change the “output format” to “sequence” and click on “get output”. The table browser will ask what type of sequence we want for PRLR, let’s go ahead and get the protein sequence.

How many transcript are there for PRLR in the UCSC known genes track? Use the results from the table browser.

Answer

9

Try using BLAT on the protein sequence for entry “uc003jjm.3”. Which Ensembl transcript does this protein most closely correspond to?

Hint

Remember that since we output and BLAT aligned a protein sequence, only the protein coding part is expected to match an Ensembl transcript annotation.

Hint

You may need to zoom in to each end of the BLAT alignment to see how it lines up with the CDS/UTR boundaries of specific Ensembl transcripts.

Answer

ENST00000382002 matches the same 622 amino acid sequence of uc003jjm.3.
uc003jjm.3 matches ENST00000382002

Genome Browsing and Visualization - Ensembl | Griffith Lab

Genomic Visualization and Interpretations

Genome Browsing and Visualization - Ensembl

The Ensembl Genome Browser provides a portal to sequence data, gene annotations/predictions, and other types of data hosted in the various Ensembl databases. Many consider Ensembl to be the most comprehensive and systematic gene annotation resource in the world. Ensembl supports a large number of species and makes data available through a powerful web portal as well as through direct database downloads and APIs. Their excellent Help & Documentation pages provide instruction on using the website, data access, APIs, and their procedures for gene annotation and prediction. Their outreach team have put together extensive teaching materials that are available for free online. Rather than duplicate effort, we have linked to some of their instructional videos below. We will review these and then perform some simple exercises to familiarize ourselves with the Ensembl Genome browser.

Introduction to genome browsers using Ensembl


What is a scaffold?

A scaffold is a long stretch of genomic sequence that has been assembled but not necessarily yet assigned to a chromosome. A scaffold is typically made up of contigs and gaps assembled into a single sequence with known order and orientation between the contigs.

Does Ensembl produce its own genome assembly?

No. Ensembl imports genome assemblies from other sources (e.g., the Genome Reference Consortium) and then annotates genes and other features to the same reference as available elsewhere (UCSC, NCBI, etc).

When do transcripts belong to the same gene?

Transcripts that share exons transcribed from the same strand are considered to belong to the same gene locus in Ensembl.

What are the two main types of transcripts annotated in Ensembl?

The two main types of transcripts annotated in Ensembl are (protein) coding and non-coding.


The Ensembl Genome Browser: an overview


What is a stable identifier in Ensembl?

Stable identifiers are IDs for features (gene, transcript, protein, exon, etc.) that should not change even when underlying data and meta-data for those features change. Examples include ENSG..., ENST..., ENSP... for human Ensembl gene, transcript, and proteins. Other species will have modified prefixes but follow the same conventions. For example, Ensembl dog genes are named ENSCAFG...

How many protein coding transcripts does human CDKN2B have?

Ensembl has two protein coding transcripts annotated for human CDKN2B.

Which species has a CDKN2B orthologous gene that most closely matches human?

The Chimpanzee CDKN2B is 100.00 similar to its human counterpart. This can be determined under the Comparative Genomics Gene Tree or Orthologues section of the human CDKN2B gene page.


Data Visualization with Ensembl

An excellent way to explore the data visualization possibilities with Ensembl is to use their Find a Data Display page. This is linked directly from the Ensembl home page (see red box below). From this page, you can you can choose a gene, region or variant and then browse a selection of relevant visualisations.

Navigate to the Find a Data Display page. To illustrate, select ‘Species’ -> ‘Human’, ‘Feature Type’ -> ‘Genes’, and then ‘Identifier’ -> ‘TP53’. You will be presented with a number of possible matches. Select the exact ‘TP53’ match and select ‘Go’. The results page, at time of writing, returned a comprehensive set of 47 views for TP53 (ENSG00000141510) associated with: Sequence & Structure, Expression & Regulation, Transcripts & Proteins, Comparative Genomics, and Variants. We will display just a few examples here and then explore others through exercises.

Select the ‘Splice Variants’ view and scroll down the page a little. You will see a graphical representation (see below) of all known and predicted transcripts for TP53, and how these exons line up with each other and with other features such as protein domains.


How many different protein domains are annotated for TP53 according to the Pfam database?

Pfam reports three domains for TP53: tetramerisation, DNA-binding, and transactivation domains.

Which domain is most consistently conserved across the many different isoforms of TP53?

It appears that all or part of the DNA-binding domain is the most consistently conserved across the isoforms of TP53 based solely on which exons are included in each isoform.


Next, examine the ‘Gene Gain/Loss Tree’ for TP53. User your browser back button (or the instructions above) to go back to the data display views for TP53 and then select ‘Gene Gain/Loss Tree’. If this does not work, you can also select ‘Gene Gain/Loss tree’ from the side bar of ‘Gene-based displays’ -> ‘Comparative Genomics’ menu on any gene page.


Which species has the most significant increase in TP53 gene?

The TP53 gene family has 14 members for elephant compared to 2, 3 or 4 for all other Ensembl species. In fact, a study in 2016 reported finding at least 20 copies of TP53 in the elephant genome and suggested that this explains why elephants do not have increased risk of cancer despite their larger body size.


Ensembl Data Display Exercise

Using your knowledge of tissue-specific expression for a specific species/gene, explore the Gene Expression views in Ensembl. Does the available data confirm your knowledge of these genes. For example, considering human genes, we might investigate: MSLN (Mesothelin) - normally present on the mesothelial cells lining the pleura, peritoneum and pericardium and over-expressed in several cancers. Other interesting human/cancer tissue markers include: KLK3 (PSA), EPCAM, SCGB2A2 (Mammaglobin), CD19, etc. Below you can see an example for PSA.


What is the tissue expession pattern for PSA?

PSA is expressly almost exclusively in the prostrate gland and only lowly in a few other tissues.


Ensembl Genomes - Extending Ensembl across the taxonomic space

The EnsemblGenomes site hosts genome-scale data from ~52,000 species, most of which are not available through the core Ensembl. Data are organized into five taxonomic categories: bacteria (n=50364), protists (n=200), fungi (n=1802), plants (n=63), and metazoa (n=74). Each generally provides at least a preliminary genome assembly, gene annotations, and to varying degrees includes: variation data, pan compara data, genome alignments, peptide alignments, and other alignments. If your species is not in Ensembl it is worth checking whether it is available in EnsemblGenomes.