My thoughts and ideas
My thoughts and ideas
A common analysis task is to convert genomic coordinates between different assemblies. Probably the most common situation is that you have some coordinates for a particular version of a reference genome and you want to determine the corresponding coordinates on a different version of the reference genome for that species. For example, you have a bed file with exon coordinates for human build GRC37 (hg19) and wish to update to GRCh38. Many resources exist for performing this and other related tasks. In this section we will go over a few tools to perform this type of analysis, in many cases these tools can be used interchangeably. This post is inspired by this BioStars post (also created by the authors of this workshop).
First let’s go over what a reference assembly actually is, in essence it’s just a representation of the nucleotide sequence from a cohort. These assemblies allow for a shortcut when mapping reads as they can be mapped to the assembly, rather than each other, to piece the genome of an individual together. This has a number of benefits, the most obvious of which is that it is far more effecient than attempting to build a genome from scratch. By its very nature however using this approach means there is no perfect reference assembly for an individual due to polymorphism (i.e. snps, hla-type, etc.). Further due to the presence of repetitive structural elements such as duplications, inverted repeats, tandem repeats, etc. a given assembly is almost always incomplete, and is constantly being improved upon. This leads to the publication of new assembly versions every so often such as grch37 (Feb. 2009) and grch38 (Dec. 2013). It is also good to be aware that different organization can publish different reference assemblies, for example grch37 (NCBI) and hg19 (UCSC) are identical save for a few minor differences such as in the mitochondria sequence and annotation of chromosomes (1 vs chr1). For a nice summary of genome versions and their release names refer to the Assembly Releases and Versions FAQ.
There are many resources available to convert coordinates from one assemlby to another, we will go over a few of these however below you will find a more complete list. The UCSC liftOver tool is one of the more popular however choosing one of these will mostly come down to personal preference.
UCSC liftOver: This tool is available through a simple web interface or it can be downloaded as a standalone executable. To use the executable you will also need to download the appropriate chain file. Each chain file describes conversions between a pair of genome assemblies. Liftover can be used through Galaxy as well. There is a python implementation of liftover called pyliftover that does conversion of point coordinates only.
NCBI Remap: This tool is conceptually similar to liftOver in that in manages conversions between a pair of genome assemblies but it uses different methods to achieve these mappings. It is also available through a simple web interface or you can use the API for NCBI Remap.
The Ensembl API: The final example I described above (converting between coordinate systems within a single genome assembly) can be accomplished with the Ensembl core API. Many examples are provided within the installation, overview, tutorial and documentation sections of the Ensembl API project. In particular, refer to these sections of the tutorial: ‘Coordinates’, ‘Coordinate systems’, ‘Transform’, and ‘Transfer’.
Assembly Converter: Ensembl also offers their own simple web interface for coordinate conversions called the Assembly Converter.
CrossMap: A standalone open source program for convenient conversion of genome coordinates (or annotation files) between different assemblies. It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF. CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome. Not recommended for converting genome coordinates between species.
Flo: A liftover pipeline for different reference genome builds of the same species. It describes the process as follows: “align the new assembly with the old one, process the alignment data to define how a coordinate or coordinate range on the old assembly should be transformed to the new assembly, transform the coordinates.”
Sex linkage was first discovered by Thomas Hunt Morgan in 1910 when he observed that the eye color of Drosophila melanogaster did not follow typical mendelian inheritance. This was discovered to be caused by the white gene located on chromosome X at coordinates 2684762-2687041 for assembly dm3. Let’s use UCSC liftOver to determine where this gene is located on the latest reference assembly for this species, dm6. First navigate to the liftOver site at https://genome.ucsc.edu/cgi-bin/hgLiftOver and set both the original and new genomes to the appropriate species, “D. melanogaster”.
We want to transfer our coordinates from the dm3 assembly to the dm6 assembly so let’s make sure the original and new assemblies are set appropriately as well.
Finally we can paste our coordinates to transfer or upload them in bed format.
The page will refresh and a results section will appear where we can download the transferred cordinates in bed format.
Try and compare the old and new coordinates in the UCSC genome browser for their respective assemblies, do they match the same gene?
Yes, both coordinates match the coding sequence for the w gene from transcript CG2759-RA
In another situation you may have coordinates of a gene and wish to determine the corresponding coordinates in another species. This is a common situation in evolutionary biology where you will need to find coordinates for a conserved gene across species to perform a phylogenetic analysis. Let’s use the rtracklayer package on bioconductor to find the coordinates of the H3F3A gene located at chr1:226061851-226071523 on the hg38 human assembly in the canFam3 assembly of the canine genome. To start install the rtracklayer package from bioconductor, as mentioned this is an R implementation of the UCSC liftover.
# install and load rtracklayer # source("https://bioconductor.org/biocLite.R") # biocLite("rtracklayer") library("rtracklayer")
The function we will be using from this package is liftover() and takes two arguments as input. The first of these is a GRanges object specifying coordinates to perform the query on. This class is from the GenomicRanges package maintained by bioconductor and was loaded automatically when we loaded the rtracklayer library. The second item we need is a chain file, which is a format which describes pairwise alignments between sequences allowing for gaps. The UCSC website maintains a selection of these on it’s genome data page. Navigate to this page and select “liftOver files” under the hg38 human genome, then download and extract the “hg38ToCanFam3.over.chain.gz” chain file.
Next all we need to do is to create our GRanges object to contain the coordinates chr1:226061851-226071523 and import our chain file with the function [import.chain()]. We can then supply these two parameters to liftover().
# specify coordinates to liftover grObject <- GRanges(seqnames=c("chr1"), ranges=IRanges(start=226061851, end=226071523)) # import the chain file chainObject <- import.chain("hg38ToCanFam3.over.chain") # run liftOver results <- as.data.frame(liftOver(grObject, chainObject))
How many different regions in the canine genome match the human region we specified?
210, these return the ranges mapped for the corresponding input element
Try to perform the same task we just complete with the web version of liftOver, how are the results different?
both methods provide the same overall range, however using rtracklayer is not simplified and contains multiple ranges corresponding to the chain file.
Ensembl BioMart is a powerful web tool (with API) for performing complex querying and filtering of the various Ensembl databases (Ensembl Genes, Mouse Strains, Ensembl Variation, and Ensembl Regulation). It is often used for ID mapping and feature extraction. Almost any data that is viewable in the Ensembl genome browser can be accessed systematically from BioMart.
To browse Ensembl BioMart you can simply navigate to the main page directly, by google or from any Ensembl page.
To do anything, you will first need to select a database from the CHOOSE DATABASE menu. Currently, this can be any of: Ensembl Genes, Mouse Strains, Ensembl Variation, or Ensembl Regulation. Note, the database version is included in the name (e.g., Ensembl Genes 90). The Ensembl Genes dataset is probably the most commonly desired choice, depending on your purpose. Next, you will select a dataset from the CHOOSE DATASET menu, which for Genes means selecting a species (e.g., Human genes (GRCh38.p10)). For illustration, we will walk through some examples using the Ensembl Genes 90 database and Human genes dataset. Once a database and dataset have been selected, you have the option to apply Filters and desired Attributes from the left panel. Note that two attributes (Gene stable ID and Transcript stable ID) have been pre-selected for us.
If we were to select Results now you would get the complete Gene and Transcript stable IDs for all genes in the Human Ensembl Genes (v90) database (see below). By default, 10 rows of results in HTML format (including hyperlinks to Ensembl pages) are shown. Selecting Count would give us a numerical summary of all records. In this case there are 63,967 results (Ensembl genes) available. We have the option to view or export in HTML or plain text (TSV or CSV) and also to export as XLS. Depending on what attributes are selected for display, there may be redundant results. For example, if we ask for only HGNC symbol as an attribute, BioMart will return the symbol for every Ensembl gene. Some Ensembl genes have the same HGNC symbol. Selecting Unique results only will filter out any redundant result rows from the display or exported file. Selecting New would reset our BioMart query.
The Filters option (left panel) allows for very powerful filtering. Again, using the Human Genes database/dataset as an example, this allows filtering of human Ensembl genes (and their attributes) by: Region, Gene, Phenotype, Gene Ontology, Multi Species Comparisons, Protein Domains and Families, and Variants. For example, suppose that we want to determine all of the genes on chromosome 17 between positions 39,600,000 and 39,800,000. We can specify a filter with Chromosome/scaffold=17, Coordinates Start=39600000 and End=39800000. For Attributes we might specify Gene stable ID, HGNC symbol, Chromosome/scaffold name, Gene start (bp), and Gene end (bp). The last three will help us verify that BioMart is returning genes in the requested region.
By selecting Count and then Results, limiting to Unique results only and increasing the View to 20, we can see that there are 12 Ensembl genes in this region, including 11 with HGNC symbols. This region includes the genes commonly amplified in HER2+ (ERBB2+) breast cancer. Note that one gene (IKZF3) starts within the filter region but its Gene end extends beyond it. This tells us something important about how BioMart applies coordinate filtering. If we wanted only genes entirely contained within the region we would need to apply our own additional filters.
Another very common use of BioMart is for gene ID mapping. In the Gene section of Filters it is possible to limit to only Ensembl genes that have at least one associated external gene ID or microarray probe(set) from a specific source. A very large number of such external gene ID sources (e.g., CCDS, Entrez, GO, HGNC, LRG, RefSeq, Unigene, etc) or microarray platforms (e.g., Affymetrix, Agilent, Illumina) are available. It is also possible to input a list of individual external IDs or probe(set) IDs of interest and get back the matching Ensembl gene records. For example, lets suppose that we had the list of HGNC gene symbols for the HER2-amplified region mentioned above (NEUROD2, PPP1R1B, STARD3, TCAP, PNMT, PGAP3, ERBB2, MIR4728, MIEN1, GRB7, IKZF3) and we wanted to know the corresponding Ensembl gene records and relevant attributes. Create a New query and again select the Ensembl Genes 90 database and Human genes dataset. Go to Filters -> GENE, check Input external references ID list, select HGNC symbol(s) and enter the above genes in the box (one per line). Now go to Attributes -> Features and choose GENE: Ensembl - Gene stable ID, Chromosome, Gene start, Gene end, Gene name; EXTERNAL: External References - HGNC symbol; Select Results. We now have direct mapping of HGNC symbols to Ensembl Gene IDs and associated attributes.
Note: It is important to remember that this procedure for mapping only works for genes that are represented as Ensembl gene annotations and is also dependent on their internal mapping between identifiers. It is possible that a valid protein might exist in another system (e.g., UniProt) and might have a valid link to another system (e.g., Entrez Gene) according to some (Entrez or UniProt) mapping process. But, if this gene was not successfully annotated in Ensembl or not successfully linked to both of these identifiers then it would be impossible to use BioMart to map from one to the other. Gene ID mapping is complex with multiple types of underlying analysis (methods for sequence or coordinate comparison) to determine equivalence and there may be one-to-many or many-to-many relationships. Ensembl and BioMart provide a valuable tool for dealing with this problem. However edge cases exist where it may not suit your purpose. It is always a good idea to determine which genes have failed to map and determine whether this is acceptable to you.
Also note: In the above query we asked for both the Ensembl Gene name and HGNC symbol. In many cases these are the same but not always. In some cases where an HGNC symbol has not yet (or recently) been assigned, Ensembl may choose another source or convention for its Gene name.
Another powerful application of BioMart is for the retrieval of Sequence attributes for specific genes or transcripts. Suppose that we would like all peptide sequences for protein-coding transcripts of TP53. Create a New query and again select the Ensembl Genes 90 database and Human genes dataset. Go to Filters -> GENE, check Input external references ID list, select HGNC symbol(s) and enter TP53 in the box. Also, select Transcript type = protein_coding. Now go to Attributes -> Sequences and choose SEQUENCES: Sequences - Peptide; HEADER INFORMATION: Gene Information - Gene stable ID, Gene name, UniProtKB/Swiss-Prot ID; Transcript Information - Transcript stable ID, Protein stable ID, Transcript type; Select Results.
Here (above) we see peptide sequences in fasta format for all protein-coding Ensembl transcripts. Where available, the UniProt ID is listed along with Ensembl gene name, and gene, transcript, and protein ids. Download the fasta file by selecting Export all results, File, FASTA, Unique results only, and Go. Open the file in a text editor/viewer. Note that for some transcripts the peptides do not start with a methionine (e.g., ENST00000576024) or end with a stop codon (e.g., ENST00000503591). In some cases these transcripts are predictions from the Ensembl automated annotation pipeline and have limited biological evidence or support.
How many transcripts are there (in Human Ensembl v90) for the gene TP53 that are: (A) protein-coding; (B) supported by at least one non-suspect mRNA (Transcript Support Level = TSL:1); but (C) have peptide sequences that still do not have a proper start or stop codon?
Get a hint!
Get a hint!
Sometimes, two queries are required to get a specific answer from BioMart. In this case you have been asked to limit to only transcripts with a certain Transcript support level (TSL). Unfortunately, this is not available under Filters. However, it is available under Attributes. Also note, Ensembl Transcript IDs can be used as a filter with the Filters -> GENE -> Input external references ID list.
There are is one TP53 transcript with TSL1 that does not have a start (ENST00000576024) and another that does not have a stop codon(ENST00000514944).
Create a New query and select the Ensembl Genes 90 database and Human genes dataset. Go to Filters -> GENE, check Input external references ID list, select HGNC symbol(s) and enter TP53 in the box. Also, select Transcript type = protein_coding. Select Attributes -> Features and choose GENE: Ensembl - Transcript stable ID and Transcript support level (TSL). Export the results to file (e.g., XLS), open in Excel (or similar), sort on TSL, and extract the Ensembl Transcript IDs for all tsl1 transcripts. Start a new query. Go to Filters -> GENE, check Input external references ID list, select Transcript stable ID(s) and enter the list of Ensembl transcripts from above in the box. Now go to Attributes -> Sequences and choose SEQUENCES: Sequences - Peptide; HEADER INFORMATION: Gene Information - Gene stable ID, Gene name, UniProtKB/Swiss-Prot ID; Transcript Information - Transcript stable ID, Protein stable ID, Transcript type; Select Results. Look for peptide sequences that do not start with M or end with a stop.
By default, Ensembl BioMart only presents data for the latest, most current version of Ensembl. Older versions can be accessed by navigating to an Ensembl Archive Site (linked from the bottom right of every Ensembl page) and then following the BioMart link (top left of every Ensembl Archive page). For example, the last version of Ensembl for the human GRCh37 (hg19) build was v75 (February 2014). The Archive EnsEMBL release 75 was available at http://feb2014.archive.ensembl.org and the corresponding BioMart at http://feb2014.archive.ensembl.org/biomart/martview.
In some cases it may be desirable to obtain data from Ensembl programmatically. This can be done in several ways. First, entire databases can be downloaded from the Ensembl FTP site in a variety of formats, from flat files to MySQL dumps. Second, Ensembl provides direct access to their databases via APIs. There are two main options: (1) the Ensembl Perl API; and (2) the Ensembl REST API. The Perl API has a longer history of use, supports many legacy scripts, and might be more comprehensive in terms of the number and complexity of queries it enables. It also supports any database version currently hosted on the web or locally installed. The REST API is more modern and allows you access to Ensembl data using any programming language. However, it appears to support only the most current database version (and version 75 for the human GRCh37 assembly). Finally, Ensembl BioMart also provides APIs for programmatic access. Again, there are several options including: (1) The BioMart Perl API; (2) BioMart RESTful access (via Perl and wget); and (3) The BiomaRt Bioconductor R package. The first two options are convenient because for any query you have configured in the BioMart website, you may simply select the Perl or XML buttons and you will have all of the code needed to execute a Perl API or RESTful API request via the command line. However, for those working in R or with less linux/Perl experience, the R Bioconductor may be preferred. We will demonstrate this final option here.
For illustration, we will recreate the Gene ID Mapping example from above. In RStudio or at an R prompt, execute the following commands:
# Load the BioMart library library("biomaRt") # Output all available databases for use from BioMart databases <- listEnsembl() databases # Output all available datasets for the "Ensembl Genes" database ensembl <- useEnsembl(biomart="ensembl") datasets <- listDatasets(ensembl) datasets # Connect to the live Ensembl Gene Human dataset ensembl <- useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl") # Output all attributes to see which are available and how they are named attributes <- listAttributes(ensembl) attributes # Output all filters to see which are available and how they are named filters <- listFilters(ensembl) filters # Get Ensembl gene records and relevant attributes for a list of HGNC symbols hgnc_symbols <- c("NEUROD2", "PPP1R1B", "STARD3", "TCAP", "PNMT", "PGAP3", "ERBB2", "MIR4728", "MIEN1", "GRB7", "IKZF3") annotations_ENSG <- getBM(attributes=c("ensembl_gene_id","chromosome_name","start_position","end_position","external_gene_name","hgnc_symbol"), filter="hgnc_symbol", values=hgnc_symbols, mart=ensembl) annotations_ENSG
Note that the output is identical to what we retrieved earlier from the BioMart web interface. This is just a simple illustration of how (arbitrarily complex) Ensembl BioMart queries can be incorporated into an R script for analysis, visualization and interpretation.
A common analysis task is to convert genomic coordinates between different assemblies. Probably the most common situation is that you...
Ensembl BioMart is a powerful web tool (with API) for performing complex querying and filtering of the various Ensembl...