Welcome to the blog

Posts

My thoughts and ideas

Introduction to BioMart | Griffith Lab

Genomic Visualization and Interpretations

Introduction to BioMart

Ensembl BioMart is a powerful web tool (with API) for performing complex querying and filtering of the various Ensembl databases (Ensembl Genes, Mouse Strains, Ensembl Variation, and Ensembl Regulation). It is often used for ID mapping and feature extraction. Almost any data that is viewable in the Ensembl genome browser can be accessed systematically from BioMart.

Browsing

To browse Ensembl BioMart you can simply navigate to the main page directly, by google or from any Ensembl page.

To do anything, you will first need to select a database from the CHOOSE DATABASE menu. Currently, this can be any of: Ensembl Genes, Mouse Strains, Ensembl Variation, or Ensembl Regulation. Note, the database version is included in the name (e.g., Ensembl Genes 95). The Ensembl Genes dataset is probably the most commonly desired choice, depending on your purpose. Next, you will select a dataset from the CHOOSE DATASET menu, which for Genes means selecting a species (e.g., Human genes (GRCh38.p10)). For illustration, we will walk through some examples using the Ensembl Genes 95 database and Human genes dataset. Once a database and dataset have been selected, you have the option to apply Filters and desired Attributes from the left panel. Note that two attributes (Gene stable ID and Transcript stable ID) have been pre-selected for us.

If we were to select Results now you would get the complete Gene and Transcript stable IDs for all genes in the Human Ensembl Genes (v95) database (see below). By default, 10 rows of results in HTML format (including hyperlinks to Ensembl pages) are shown. Selecting Count would give us a numerical summary of all records. In this case there are 63,967 results (Ensembl genes) available. We have the option to view or export in HTML or plain text (TSV or CSV) and also to export as XLS. Depending on what attributes are selected for display, there may be redundant results. For example, if we ask for only HGNC symbol as an attribute, BioMart will return the symbol for every Ensembl gene. Some Ensembl genes have the same HGNC symbol. Selecting Unique results only will filter out any redundant result rows from the display or exported file. Selecting New would reset our BioMart query.

Basic Filtering

The Filters option (left panel) allows for very powerful filtering. Again, using the Human Genes database/dataset as an example, this allows filtering of human Ensembl genes (and their attributes) by: Region, Gene, Phenotype, Gene Ontology, Multi Species Comparisons, Protein Domains and Families, and Variants. For example, suppose that we want to determine all of the genes on chromosome 17 between positions 39,600,000 and 39,800,000. We can specify a filter with Chromosome/scaffold=17, Coordinates Start=39600000 and End=39800000 (Note: do not include commas in the coordinates). For Attributes we might specify Gene stable ID, HGNC symbol, Chromosome/scaffold name, Gene start (bp), and Gene end (bp). The last three will help us verify that BioMart is returning genes in the requested region.

By selecting Count and then Results, limiting to Unique results only and increasing the View to 20, we can see that there are 12 Ensembl genes in this region, including 11 with HGNC symbols. This region includes the genes commonly amplified in HER2+ (ERBB2+) breast cancer. Note that one gene (IKZF3) starts within the filter region but its Gene end extends beyond it. This tells us something important about how BioMart applies coordinate filtering. If we wanted only genes entirely contained within the region we would need to apply our own additional filters.

Gene ID Mapping

Another very common use of BioMart is for gene ID mapping. In the Gene section of Filters it is possible to limit to only Ensembl genes that have at least one associated external gene ID or microarray probe(set) from a specific source. A very large number of such external gene ID sources (e.g., CCDS, Entrez, GO, HGNC, LRG, RefSeq, Unigene, etc) or microarray platforms (e.g., Affymetrix, Agilent, Illumina) are available. It is also possible to input a list of individual external IDs or probe(set) IDs of interest and get back the matching Ensembl gene records. For example, lets suppose that we had the list of HGNC gene symbols for the HER2-amplified region mentioned above (NEUROD2, PPP1R1B, STARD3, TCAP, PNMT, PGAP3, ERBB2, MIR4728, MIEN1, GRB7, IKZF3) and we wanted to know the corresponding Ensembl gene records and relevant attributes. Create a New query and again select the Ensembl Genes 95 database and Human genes dataset. Go to Filters -> GENE, check Input external references ID list, select HGNC symbol(s) and enter the above genes in the box (one per line). Now go to Attributes -> Features and choose GENE: Ensembl - Gene stable ID, Chromosome, Gene start, Gene end, Gene name; EXTERNAL: External References - HGNC symbol; Select Results. We now have direct mapping of HGNC symbols to Ensembl Gene IDs and associated attributes.

Note: It is important to remember that this procedure for mapping only works for genes that are represented as Ensembl gene annotations and is also dependent on their internal mapping between identifiers. It is possible that a valid protein might exist in another system (e.g., UniProt) and might have a valid link to another system (e.g., Entrez Gene) according to some (Entrez or UniProt) mapping process. But, if this gene was not successfully annotated in Ensembl or not successfully linked to both of these identifiers then it would be impossible to use BioMart to map from one to the other. Gene ID mapping is complex with multiple types of underlying analysis (methods for sequence or coordinate comparison) to determine equivalence and there may be one-to-many or many-to-many relationships. Ensembl and BioMart provide a valuable tool for dealing with this problem. However edge cases exist where it may not suit your purpose. It is always a good idea to determine which genes have failed to map and determine whether this is acceptable to you.

Also note: In the above query we asked for both the Ensembl Gene name and HGNC symbol. In many cases these are the same but not always. In some cases where an HGNC symbol has not yet (or recently) been assigned, Ensembl may choose another source or convention for its Gene name.

Getting Sequence Attributes

Another powerful application of BioMart is for the retrieval of Sequence attributes for specific genes or transcripts. Suppose that we would like all peptide sequences for protein-coding transcripts of TP53. Create a New query and again select the Ensembl Genes 95 database and Human genes dataset. Go to Filters -> GENE, check Input external references ID list, select HGNC symbol(s) and enter TP53 in the box. Also, select Transcript type = protein_coding. Now go to Attributes -> Sequences and choose SEQUENCES: Sequences - Peptide; HEADER INFORMATION: Gene Information - Gene stable ID, Gene name, UniProtKB/Swiss-Prot ID; Transcript Information - Transcript stable ID, Protein stable ID, Transcript type; Select Results.

Here (above) we see peptide sequences in fasta format for all protein-coding Ensembl transcripts. Where available, the UniProt ID is listed along with Ensembl gene name, and gene, transcript, and protein ids. Download the fasta file by selecting Export all results, File, FASTA, Unique results only, and Go. Open the file in a text editor/viewer. Note that for some transcripts the peptides do not start with a methionine (e.g., ENST00000576024) or end with a stop codon (e.g., ENST00000503591). In some cases these transcripts are predictions from the Ensembl automated annotation pipeline and have limited biological evidence or support.

Ensembl BioMart practice exercises

How many transcripts are there (in Human Ensembl v95) for the gene TP53 that are: (A) protein-coding; (B) supported by at least one non-suspect mRNA (Transcript Support Level = TSL:1); but (C) have peptide sequences that still do not have a proper start or stop codon?

Get a hint!

To learn more about Ensembl Transcript Support Levels you can read their definitions in the Ensembl Glossary or review the transcript table for TP53.

Get a hint!

Sometimes, two queries are required to get a specific answer from BioMart. In this case you have been asked to limit to only transcripts with a certain Transcript support level (TSL). Unfortunately, this is not available under Filters. However, it is available under Attributes. Also note, Ensembl Transcript IDs can be used as a filter with the Filters -> GENE -> Input external references ID list.

Answer

There are is one TP53 transcript with TSL1 that does not have a start (ENST00000576024) and another that does not have a stop codon(ENST00000514944).

Solution

Create a New query and select the Ensembl Genes 95 database and Human genes dataset. Go to Filters -> GENE, check Input external references ID list, select HGNC symbol(s) and enter TP53 in the box. Also, select Transcript type = protein_coding. Select Attributes -> Features and choose GENE: Ensembl - Transcript stable ID and Transcript support level (TSL). Export the results to file (e.g., XLS), open in Excel (or similar), sort on TSL, and extract the Ensembl Transcript IDs for all tsl1 transcripts. Start a new query. Go to Filters -> GENE, check Input external references ID list, select Transcript stable ID(s) and enter the list of Ensembl transcripts from above in the box. Now go to Attributes -> Sequences and choose SEQUENCES: Sequences - Peptide; HEADER INFORMATION: Gene Information - Gene stable ID, Gene name, UniProtKB/Swiss-Prot ID; Transcript Information - Transcript stable ID, Protein stable ID, Transcript type; Select Results. Look for peptide sequences that do not start with M or end with a stop.

Using older versions of Ensembl BioMart

By default, Ensembl BioMart only presents data for the latest, most current version of Ensembl. Older versions can be accessed by navigating to an Ensembl Archive Site (linked from the bottom right of every Ensembl page) and then following the BioMart link (top left of every Ensembl Archive page). For example, the last version of Ensembl for the human GRCh37 (hg19) build was v75 (February 2014). The Archive EnsEMBL release 75 was available at http://feb2014.archive.ensembl.org and the corresponding BioMart at http://feb2014.archive.ensembl.org/biomart/martview.

Using the Ensembl BioMart API in R

In some cases it may be desirable to obtain data from Ensembl programmatically. This can be done in several ways. First, entire databases can be downloaded from the Ensembl FTP site in a variety of formats, from flat files to MySQL dumps. Second, Ensembl provides direct access to their databases via APIs. There are two main options: (1) the Ensembl Perl API; and (2) the Ensembl REST API. The Perl API has a longer history of use, supports many legacy scripts, and might be more comprehensive in terms of the number and complexity of queries it enables. It also supports any database version currently hosted on the web or locally installed. The REST API is more modern and allows you access to Ensembl data using any programming language. However, it appears to support only the most current database version (and version 75 for the human GRCh37 assembly). Finally, Ensembl BioMart also provides APIs for programmatic access. Again, there are several options including: (1) The BioMart Perl API; (2) BioMart RESTful access (via Perl and wget); and (3) The BiomaRt Bioconductor R package. The first two options are convenient because for any query you have configured in the BioMart website, you may simply select the Perl or XML buttons and you will have all of the code needed to execute a Perl API or RESTful API request via the command line. However, for those working in R or with less linux/Perl experience, the R Bioconductor may be preferred. We will demonstrate this final option here.

For illustration, we will recreate the Gene ID Mapping example from above. In RStudio or at an R prompt, execute the following commands:

# Install and load the BioMart library
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("biomaRt", version = "3.8")
library("biomaRt")

# Output all available databases for use from BioMart
databases <- listEnsembl()
databases

# Output all available datasets for the "Ensembl Genes" database
ensembl <- useEnsembl(biomart="ensembl")
datasets <- listDatasets(ensembl)
datasets

# Connect to the live Ensembl Gene Human dataset
ensembl <- useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")

# Output all attributes to see which are available and how they are named
attributes <- listAttributes(ensembl)
attributes

# Output all filters to see which are available and how they are named
filters <- listFilters(ensembl)
filters

# Get Ensembl gene records and relevant attributes for a list of HGNC symbols
hgnc_symbols <- c("NEUROD2", "PPP1R1B", "STARD3", "TCAP", "PNMT", "PGAP3", "ERBB2", "MIR4728", "MIEN1", "GRB7", "IKZF3")
attributes <- c("ensembl_gene_id","chromosome_name","start_position","end_position","external_gene_name","hgnc_symbol")
annotations_ENSG <- getBM(attributes=attributes, filter="hgnc_symbol", values=hgnc_symbols, mart=ensembl)
annotations_ENSG

Note that the output is identical to what we retrieved earlier from the BioMart web interface. This is just a simple illustration of how (arbitrarily complex) Ensembl BioMart queries can be incorporated into an R script for analysis, visualization and interpretation.

Genome Browsing and Visualization - IGV | Griffith Lab

Genomic Visualization and Interpretations

Genome Browsing and Visualization - IGV

It is often necessary to examine sequencing data aligned to specific regions of the genome in order to obtain a clearer picture of genomic events. One of the most popular tools for this is the Integrative Genomics Viewer. After this lab you should be able to perform the following tasks:

  1. Visualize a variety of genomic data
  2. Quickly navigate around the genome
  3. Visualize read alignments
  4. Validate SNP/SNV calls and structural re-arrangements by eye

Installing IGV

Java is necessary to run IGV, you can download the java runtime environment (JRE) for your operating system here. To determine if this step is necessary type java -version at a command prompt, if the program is not >= 1.7 you’ll need to upgrade it. IGV can be downloaded here. This tutorial will make use of IGV version 2.3 or later, we strongly recommend that you upgrade IGV if you have an older version installed.

Data Set for IGV

We will be using publicly available Illumina sequence data from the HCC1143 cell line. The HCC1143 cell line was generated from a 52 year old caucasian woman with breast cancer. Additional information on this cell line can be found here: (tumor, TNM stage IIA, grade 3, primary ductal carcinoma) and HCC1143/BL (matched normal EBV transformed lymphoblast cell line). Reads within these cell lines have been filtered to Chromosome 21: 19,000,000-20,000,000 in order to reduce file sizes.

Visualization Part 1: Getting familiar with IGV

We will be visualizing read alignments using IGV, a popular visualization tool for HTS data.

First, lets familiarize ourselves with it.

Load a Genome and some Data Tracks

By default, IGV loads Human (hg19). If you work with another version of the human genome, or another organism altogether, you can change the genome by clicking the drop down menu in the upper-left. For this lab, we will be using Human (hg19).

We will also load additional tracks from the IGV Server using (File -> Load from Server...):

Load hg19 genome and additional data tracks

You should see a listing of chromosomes for this reference genome. Choose 1, for chromosome 1.

Chromosome chooser

Navigate to chr1:10,000-11,000 by entering this into the location field (in the top-left corner of the interface) and clicking Go. This shows a window of chromosome 1 that is 1,000 base pairs wide and beginning at position 10,000.

Navigition using Location text field. Sequence displayed as thin coloured rectangles.

IGV displays the sequence of letters in a genome as a sequence of colours (e.g. A = green, C = blue, etc.). This makes repetitive sequences, like the ones found at the start of this region, easy to identify. Zoom in a bit more using the + button (top right) to see the individual bases of the reference genome sequence.

You can navigate to a gene of interest by typing it into the same box that the genomic coordinates are in and pressing Enter/Return. Try it for your favourite gene, or BRCA1 if you can not decide.

Gene model

Genes are represented as lines and boxes. Lines represent intronic regions, and boxes represent exonic regions. The arrows indicate the direction/strand of transcription for the gene. When an exon box become narrower in height, this indicates a UTR.

When loaded, tracks are stacked on top of each other. You can identify which track is which by consulting the label to the left of each track.

Region Lists

Sometimes, it is useful to save where you are, or to load regions of interest. For this purpose, there is a Region Navigator in IGV. To access it, click Regions > Region Navigator. While you browse around the genome, you can save some bookmarks by pressing the Add button at any time.

Bookmarks in IGV

Loading Read Alignments

We will be using the breast cancer cell line HCC1143 to visualize alignments. For speed, only a small portion of chr21 will be loaded (19M:20M).

HCC1143 Alignments to hg19:

Copy the files to your local drive, and in IGV choose File > Load from File..., select the bam file, and click OK. Note that the bam and index files must be in the same directory for IGV to load these properly. Alternatively, you can copy the link location and load File > Load from URL....

Load BAM track from File

Visualizing read alignments

Navigate to a narrow window on chromosome 21: chr21:19,480,041-19,480,386.

To start our exploration, right click on the track-name, and select the following options:

Experiment with the various settings by right clicking the read alignment track and toggling the options. Think about which would be best for specific tasks (e.g. quality control, SNP calling, CNV finding).

Changing how read alignments are sorted, grouped, and colored

You will see reads represented by grey or white bars stacked on top of each other, where they were aligned to the reference genome. The reads are pointed to indicate their orientation (i.e. the strand on which they are mapped). Mouse over any read and notice that a lot of information is available. To toggle read display from hover to click, select the yellow box and change the setting.

Changing how read information is shown (i.e. on hover, click, never)

Once you select a read, you will learn what many of these metrics mean, and how to use them to assess the quality of your datasets. At each base that the read sequence mismatches the reference, the colour of the base represents the letter that exists in the read (using the same colour legend used for displaying the reference).

Viewing read information for a single aligned read

Visualization Part 2: Inspecting SNPs, SNVs, and SVs

In this section we will be looking in detail at 8 positions in the genome, and determining whether they represent real events or artifacts.

Two neighbouring SNPs

Example1. Good quality SNVs/SNPs

Notes:

What does Shade base by quality do?

This will change the opacity of the base in IGV based on how confident the sequencer was in calling that base using the phred score. This is beneficial in determining if a called variant is real or artifactual.

How does Color by read strand help?

Coloring by read strand will indicate if the DNA fragment sequenced was on the positive or negative strand. A variant occurring on only one strand could indicate an artifact.

Homopolymer region with indel

Navigate to position chr21:19,518,412-19,518,497

Example 2a

Example 2b

Notes:

Coverage by GC

Navigate to position chr21:19,611,925-19,631,555. Note that the range contains areas where coverage drops to zero in a few places.

Example 3

Why are there blue and red reads throughout the alignments?

The reads are colored by insert size, in paired data a blue read indicates the insert size is smaller than expected indicating a deletion. Conversely a red read indicates the insert size is larger than expected indicating an insertion.

Heterozygous SNPs on different alleles

Navigate to region chr21:19,666,833-19,667,007

Example 4

Note:

Low mapping quality

Navigate to region chr21:19,800,320-19,818,162

Load repeats

Example 5

Notes:

Homozygous deletion

Navigate to region chr21:19,324,469-19,331,468

Example 6

Notes:

Mis-alignment

Navigate to region chr21:19,102,154-19,103,108

Example 7

Notes:

Translocation

Navigate to region chr21:19,089,694-19,095,362

Example 8

Notes:

A Standard Operating Procedure (SOP) for Manual Review of Variants

As illustrated above, manual review of sequence alignments is often necessary to confirm/validate called variants. Despite the widespread usage of manual review, published studies rarely document the procedure and best practices for this review. To address this, Barnell et al published a “Standard operating procedure for somatic variant refinement of sequencing data with paired tumor and normal samples”. This SOP documents methods and examples of how to categorize variants into four different classes (Somatic, Germline, Ambiguous or Fail) with 19 tags for common sources of sequence or analysis artifacts. This SOP supports and enhances manual somatic variant detection by improving reviewer accuracy while reducing the inter-reviewer variability for variant calling and annotation.

Visualization Part 3: Automating Tasks in IGV

We can use the Tools menu to invoke running a batch script. Batch scripts are described on the IGV website:

Download the batch script and the attribute file for our dataset:

Now run the file from the Tools menu:

Automation

Notes: