author: MRC Clinical Sciences Centre date: http://mrccsc.github.io/training.html autosize: true author: "MRC CSC Bioinformatics Core Team" date:http://mrccsc.github.io/training.html width: 1440 height: 1100 autosize: true font-import: font-family: 'Slabo 27px', serif; css:style.css
- The Course
- Importance of Visualising Genomics Data.
- Reminder of file types
- Reminder of data types
- Materials
- Visualising genomics data in R
- Plotting genome axis
- Plotting genome data
- Plotting genome annotation
- Plotting genome sequence
- Plotting genomic alignments
- Plotting from external databases
id: vizdata
It is an essential step in genomics data analysis to visualise your data. This allows you to review data for both known or unexpected data characteristics and potential artefacts.
While we have discussed using IGV to review genomics data, now we will discuss how to do this while still working with in the R.
id: vizdataR
In complement to our IGV genome browser course where we reviewed visualising genomics data in a browser, here we will use R/Bioconductor to produce publication quality graphics programatically.
Much of the material will require some familiarity with R and Bioconductor (you can revisit our courses on those here) and these will be used in tight conjunction with tools introduced today such as the Bioconductor package, Gviz.
id: filetypes
In this session we will be dealing with a range of data types. For more information on file types you can revisit our material.
For more information on visualising genomics data in browsers you can visit our IGV course.
- IGV.
id: datatypes
We will also encounter and make use of many data structures and data types which we have seen throughout our courses on HTS data. You can revisit this material to refresh on HTS data analysis in Bioconductor and R below.
id: materials
All material for this course can be found on github.
Or can be downloaded as a zip archive from here.
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
- presentations/ Presentations as an HTML slide show.
- presentations/exercises/ Some tasks/examples to work through.
All data to run code in the presentations and in the practicals is available in the zip archive. This includes coverage as bigWig files, aligned reads as BAM files and genomic intervals stored as BED files.
All data can be found under the Data directory
Data/
We also include some RData files containing precompiled results from querying database (in case of external server downtime). All RData files can be found in the RData directory
RData/
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived VisualisingGenomicsData folder in the Rstudio menu
Session -> Set Working Directory -> Choose Directory
or in the console.
setwd("/PathToMyDownload/VisualisingGenomicsData/")
# e.g. setwd("~/Downloads/VisualisingGenomicsData")
Genomics data can often be visualised in genome browsers such as the user friendly IGV genome browser.
This allows for the visualisation of our processed data in its genomic context.
In Genomics (and most likely any Omics), it is important to review our data/results and hypotheses in a browser to identify patterns or potential artefacts discovered or missed within our analysis.
We have already discussed on using the IGV browser to review our data and get access to online data repositories.
IGV is quick, user friendly GUI to perform the essential task of review genomics data in its context.
For more information see our course on IGV here.
Using a genome browser to review sites of interest across the genome is a critical first step.
Using processed and often indexed genomics data files, IGV offers a method to rapidly interrogate genomics data along the linear genome.
IGV does its job well and should always be an immediate early step in data review. By being good at this however it does not offer the flexibility in displaying data we wish to achieve, more so when expecting to review a large number of sites.
id:VizinR
The Gviz packages offers methods to produce publication quality plots of genomics data at genomic features of interest.
To get started using Gviz in some biological examples, first we need to install the package.
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("Gviz")
id: genomeaxis
Gviz provides methods to plot many genomics data types (as with IGV) over genomic features and genomic annotation within a linear genomic reference.
The first thing we can do then is set up our linear axis representing positions on genomes.
For this we use our first function from Gviz, GenomeAxisTrack(). Here we use the name parameter to set the name to be "myAxis".
library(Gviz)
genomeAxis <- GenomeAxisTrack(name="MyAxis")
genomeAxis
Genome axis 'MyAxis'
Now we have created a GenomeAxisTrack track object we can display the object using plotTracks function.
In order to display a axis track we need to set the limits of the plot (otherwise where would it start and end?).
plotTracks(genomeAxis,from=100,to=10100)
It is fairly straightforward to create and render this axis. Gviz offers a high degree of flexibility in the way these tracks can be plotted with some very useful plotting configurations included.
A useful feature is to add some information on the direction of the linear genome represented in this GenomeAxisTrack.
We can add labels for the 5' to 3' direction for the positive and negative strand by using the add53 and add35 parameters.
plotTracks(genomeAxis,from=100,to=10100,
add53=T,add35=T)
We can also configure the resolution of the axis (albeit rather bluntly) using the littleTicks parameter.
This will add additional axis tick marks between those shown by default.
plotTracks(genomeAxis,from=100,to=10100,
littleTicks = TRUE)
By default the plot labels for the genome axis track are alternating below and above the line.
We can further configure the axis labels using the labelPos parameter.
Here we set the labelPos to be always below the axis
plotTracks(genomeAxis,from=100,to=10100,
labelPos="below")
In the previous plots we have produced a genomic axis which allows us to consider the position of the features within the linear genome.
In some contexts we may be more interested in relative distances around and between the genomic features being displayed.
We can configure the axis track to give us a relative representative of distance using the scale parameter.
plotTracks(genomeAxis,from=100,to=10100,
scale=1,labelPos="below")
We may want to add only a part of the scale (such as with Google Maps) to allow the reviewer to get a sense of distance.
We can specify how much of the total axis we wish to display as a scale using a value of 0 to 1 representing the proportion of scale to show.
plotTracks(genomeAxis,from=100,to=10100,
scale=0.3)
We can also provide numbers greater than 1 to the scale parameter which will determine, in absolute base pairs, the size of scale to display.
plotTracks(genomeAxis,from=100,to=10100,
scale=2500)
Previously we have seen how to highlight regions of interest in the scale bar for IGV.
These "regions of interest" may be user defined locations which add context to the scale and the genomics data to be displayed (e.g. Domain boundaries such as topilogically associated domains)
We can add "regions of interest" to the axis plotted by Gviz as we have done with IGV.
To do this we will need to define some ranges to signify the positions of "regions of interest" in the linear context of our genome track.
Since the plots have no apparent context for chromosomes (yet), we will use a IRanges object to specify "regions of interest" as opposed to the genome focused GRanges.
You can see our material here on Bioconductor objects for more information on IRanges and GRanges.
To create an IRanges object we will load the IRanges library and specify vectors of start and end parameters to the IRanges constructor function.
library(IRanges)
regionsOfInterest <- IRanges(start=c(140,5140),end=c(2540,7540))
names(regionsOfInterest) <- c("ROI_1","ROI_2")
regionsOfInterest
IRanges object with 2 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
ROI_1 140 2540 2401
ROI_2 5140 7540 2401
Now we have our IRanges object representing our regions of interest we can include them in our axis.
We will have to recreate our axis track to allow us to include these regions of interest.
Once we have updated our GenomeAxisTrack object we can plot the axis with regions of interest included.
genomeAxis <- GenomeAxisTrack(name="MyAxis",
range = regionsOfInterest)
plotTracks(genomeAxis,from=100,to=10100)
We include the names specified in the IRanges for the regions of interest within the axis plot by specify the showID parameter to TRUE.
plotTracks(genomeAxis,from=100,to=10100,
range=regionsOfInterest,
showId=T)
id:datatracks
Now we have some fine control of the axis, it follows that we may want some to display some actual data along side our axis and/or regions of interest.
Gviz contains a general container for data tracks which can be created using the DataTrack() constructor function and associated object, DataTrack.
Generally DataTrack may be used to display most data types with some work but best fits ranges with associated signal as a matrix (multiple regions) or vector (single sample).
Lets update our IRanges object to have some score columns in the metadata columns. We can do this with the mcols function as shown in our Bioconductor material.
mcols(regionsOfInterest) <- data.frame(Sample1=c(30,20),Sample2=c(20,200))
regionsOfInterest <- GRanges(seqnames="chr5",ranges = regionsOfInterest)
regionsOfInterest
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | Sample1 Sample2
<Rle> <IRanges> <Rle> | <numeric> <numeric>
ROI_1 chr5 [ 140, 2540] * | 30 20
ROI_2 chr5 [5140, 7540] * | 20 200
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Now we have the data we need, we can create a simple DataTrack object.
dataROI <- DataTrack(regionsOfInterest)
plotTracks(dataROI)
As we have seen, DataTrack objects make use of IRanges/GRanges which are the central workhorse of Bioconductors HTS tools.
This means we can take advantage of the many manipulations available in the Bioconductor tool set.
Lets make use of rtracklayer's importing tools to retrieve coverage from a bigWig as a GRanges object
library(rtracklayer)
allChromosomeCoverage <- import.bw("Data/small_Sorted_SRR568129.bw",as="GRanges")
allChromosomeCoverage
GRanges object with 249 ranges and 1 metadata column:
seqnames ranges strand | score
<Rle> <IRanges> <Rle> | <numeric>
[1] chrM [1, 16571] * | 0
[2] chr1 [1, 249250621] * | 0
[3] chr2 [1, 243199373] * | 0
[4] chr3 [1, 198022430] * | 0
[5] chr4 [1, 191154276] * | 0
... ... ... ... . ...
[245] chr20 [1, 63025520] * | 0
[246] chr21 [1, 48129895] * | 0
[247] chr22 [1, 51304566] * | 0
[248] chrX [1, 155270560] * | 0
[249] chrY [1, 59373566] * | 0
-------
seqinfo: 25 sequences from an unspecified genome
Now we have our coverage as a GRanges object we can create our DataTrack object from this.
Notice we specify the chromsome of interest in the chromosome parameter.
accDT <- DataTrack(allChromosomeCoverage,chomosome="chr5")
accDT
DataTrack 'DataTrack'
| genome: NA
| active chromosome: chrM
| positions: 1
| samples:1
| strand: *
There are 248 additional annotation features on 24 further chromosomes
chr1: 1
chr10: 1
chr11: 1
chr12: 1
chr13: 1
...
chr7: 1
chr8: 1
chr9: 1
chrX: 1
chrY: 1
Call seqlevels(obj) to list all available chromosomes or seqinfo(obj) for more detailed output
Call chromosome(obj) <- 'chrId' to change the active chromosome
To plot data now using the plotTracks() function we will set the regions we wish to plot by specifying the chromsomes, start and end using the chromosome, from and to parameters.
By default we will get a similar point plot to that seen before.
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5")
We can adjust the type of plots we want using the type argument. Here as with standard plotting we can specify "l" to get a line plot.
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",type="l")
Many other types of plots are available for the DataTracks.
Including smoothed plots using "smooth".
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",type="smooth")
Histograms by specifying "h".
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",type="h")
Or filled/smoothed plots using "mountain".
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",type="mountain")
and even a Heatmap using "heatmap".
Notice that Gviz will automatically produce the appropriate Heatmap scale.
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",type="heatmap")
As with all plotting functions in R, Gviz plots are highly customisable.
Simple features such as point size and colour are easily set as for standard R plots using cex and col paramters.
plotTracks(accDT,
from=134887451,to=134888111,
chromosome="chr5",
col="red",cex=4)
Now we have shown how to construct a data track and axis track we can put them together in one plot.
To do this we simply provide the GenomeAxisTrack and DataTrack objects as vector the plotTracks() function.
plotTracks(c(accDT,genomeAxis),
from=134887451,to=134888111,
chromosome="chr5"
)
The order of tracks in the plot is simply defined by the order they are placed in the vector passed to plotTracks()
plotTracks(c(genomeAxis,accDT),
from=134887451,to=134888111,
chromosome="chr5"
)
By default, Gviz will try and provide sensible track heights for your plots to best display your data.
The track height can be controlled by providing a vector of relative heights to the sizes paramter of the plotTracks() function.
If we want the axis to be 50% of the height of the Data track we specify the size for axis as 0.5 and that of data as 1. The order of sizes must match the order of objects they relate to.
plotTracks(c(genomeAxis,accDT),
from=134887451,to=134888111,
chromosome="chr5",
sizes=c(0.5,1)
)
Time for exercises! Link here
Time for solutions! Link here
id:Annotation
Genomic annotation, such as Gene/Transcript models, play an important part of visualising genomics data in context.
Gviz provides many routes for constructing genomic annotation using the AnnotationTrack() constructor function. In contrast to the DataTrack, AnnotationTrack allows for the specification of feature groups.
First lets create a GRanges object with some more regions
toGroup <- GRanges(seqnames="chr5",
IRanges(
start=c(10,500,550,2000,2500),
end=c(300,800,850,2300,2800)
))
names(toGroup) <- seq(1,5)
toGroup
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
1 chr5 [ 10, 300] *
2 chr5 [ 500, 800] *
3 chr5 [ 550, 850] *
4 chr5 [2000, 2300] *
5 chr5 [2500, 2800] *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Now we can create the AnnotationTrack object using the constructor.
Here we also provide a grouping to the group parameter in the AnnotationTrack() function.
annoT <- AnnotationTrack(toGroup,
group = c("Ann1",
"Ann1",
"Ann2",
"Ann3",
"Ann3"))
plotTracks(annoT)
We can see the features are displayed grouped by lines.
But if we want to see the names we must specify the group parameter by using the groupAnnotation argument.
plotTracks(annoT,groupAnnotation = "group")
When we created the GRanges used here we did not specify any strand information.
strand(toGroup)
factor-Rle of length 5 with 1 run
Lengths: 5
Values : *
Levels(3): + - *
When plotting annotation without strand a box is used to display features as seen in previous slides
Now we can specify some strand information for the GRanges and replot.
Arrows now indicate the strand which the features are on.
strand(toGroup) <- c("+","+","*","-","-")
annoT <- AnnotationTrack(toGroup,
group = c("Ann1",
"Ann1",
"Ann2",
"Ann3",
"Ann3"))
plotTracks(annoT, groupingAnnotation="group")
In the IGV course we saw how you could control the display density of certain tracks.
Annotation tracks are often stored in files such as the general feature format (see our previous course).
IGV allows us to control the density of these tracks in the view options by setting to "collapsed", "expanded" or "squished".
Whereas "squished" and "expanded" maintains much of the information within the tracks, "collapsed" flattens overlapping features into a single displayed feature.
Here we have the same control over the display density of our annotation tracks.
By default the tracks are stacked using the "squish" option to make best use of the available space.
plotTracks(annoT, groupingAnnotation="group",stacking="squish")
By setting the stacking parameter to "dense", all overlapping features have been collapsed/flattened
plotTracks(annoT, groupingAnnotation="group",stacking="dense")
AnnotationTrack objects may also hold information on feature types.
For gene models we may be use to feature types such as mRNA, rRNA, snoRNA etc.
Here we can make use of feature types as well.
We can set any feature types within our data using the feature() function. Here they are unset so displayed as unknown.
feature(annoT)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
We can set our own feature types for the AnnotationTrack object using the same feature() function.
We can choose any feature types we wish to define.
feature(annoT) <- c(rep("Good",4),rep("Bad",2))
feature(annoT)
[1] "Good" "Good" "Good" "Good" "Bad" "Bad"
Now we have defined our feature types we can use this information within our plots.
In GViz, we can directly specify attributes for individual feature types within our AnnotationTrack, in this example we add attributes for colour to be displayed.
We specify the "Good" features as blue and the "Bad" features as red.
plotTracks(annoT, featureAnnotation = "feature",
groupAnnotation = "group",
Good="Blue",Bad="Red")
id:grtrack
We have seen how we can display complex annotation using the AnnotationTrack objects.
For gene models Gviz contains a more specialised object, the GeneRegionTrack object.
The GeneRegionTrack object contains additional parameters and display options specific for the display of gene models.
Lets start by looking at the small gene model set stored in the Gviz package.
data(geneModels)
head(geneModels)
chromosome start end width strand feature gene
1 chr7 26591441 26591829 389 + lincRNA ENSG00000233760
2 chr7 26591458 26591829 372 + lincRNA ENSG00000233760
3 chr7 26591515 26591829 315 + lincRNA ENSG00000233760
4 chr7 26594428 26594538 111 + lincRNA ENSG00000233760
5 chr7 26594428 26596819 2392 + lincRNA ENSG00000233760
6 chr7 26594641 26594733 93 + lincRNA ENSG00000233760
exon transcript symbol
1 ENSE00001693369 ENST00000420912 AC004947.2
2 ENSE00001596777 ENST00000457000 AC004947.2
3 ENSE00001601658 ENST00000430426 AC004947.2
4 ENSE00001792454 ENST00000457000 AC004947.2
5 ENSE00001618328 ENST00000420912 AC004947.2
6 ENSE00001716169 ENST00000457000 AC004947.2
chromosome start end width strand feature gene
1 chr7 26591441 26591829 389 + lincRNA ENSG00000233760
2 chr7 26591458 26591829 372 + lincRNA ENSG00000233760
3 chr7 26591515 26591829 315 + lincRNA ENSG00000233760
4 chr7 26594428 26594538 111 + lincRNA ENSG00000233760
5 chr7 26594428 26596819 2392 + lincRNA ENSG00000233760
6 chr7 26594641 26594733 93 + lincRNA ENSG00000233760
exon transcript symbol
1 ENSE00001693369 ENST00000420912 AC004947.2
2 ENSE00001596777 ENST00000457000 AC004947.2
3 ENSE00001601658 ENST00000430426 AC004947.2
4 ENSE00001792454 ENST00000457000 AC004947.2
5 ENSE00001618328 ENST00000420912 AC004947.2
6 ENSE00001716169 ENST00000457000 AC004947.2
We can see that this data.frame contains information on start, end , chromosome and strand of feature needed to position features in a linear genome.
Also included are a feature type column named "feature" and columns containing additional metadata to group by - "gene","exon","transcript","symbol".
We can define a GeneRegionTrack as we would all other tracktypes. Here we provide a genome name, chromosome of interest and a name for the track.
grtrack <- GeneRegionTrack(geneModels, genome = "hg19",
chromosome = "chr7",
name = "smallRegions")
plotTracks(grtrack)
plotTracks(grtrack)
We can see that features here are rendered slightly differently to those in an AnnotationTrack object.
Here direction is illustrated by arrows in introns and unstranslated regions are shown as narrower boxes.
Since gene models typically contain exon, transcript and gene level annotation we can specify how we wish to annotate our plots by using the transcriptAnnotation and exonAnnotation parameters.
To label all transcripts by the gene annotation we specify the gene column to the transcriptAnnotation parameter.
plotTracks(grtrack,transcriptAnnotation="gene")
Similarly we can label transcripts by their individual transcript names.
plotTracks(grtrack,transcriptAnnotation="transcript")
Or we can label using the transcriptAnnotation object by any arbitary column where there is one level per transcript.
plotTracks(grtrack,transcriptAnnotation="symbol")
As with transcripts we can label individual features using the exonAnnotation parameter by any arbitary column where there is one level per feature/exon.
plotTracks(grtrack,exonAnnotation="exon",from=26677490,to=26686889,cex=0.5)
We saw that we can control the display density when plotting AnnotationTrack objects.
We can control the display density of GeneRegionTracks in the same manner.
plotTracks(grtrack, stacking="dense")
However, since the GeneRegionTrack object is a special class of the AnnotationTrack object we have special parameter for dealing with display density of transcripts.
The collapseTranscripts parameter allows us a finer degree of control than that seen with stacking parameter.
Here we set collapseTranscripts to be true inorder to merge all overlapping transcripts.
plotTracks(grtrack, collapseTranscripts=T,
transcriptAnnotation = "symbol")
Collapsing using the collapseTranscripts has summarised our transcripts into their respective gene boundaries.
We have however lost information on the strand of transcripts. To retain this information we need to specify a new shape for our plots using the shape parameter. To capture direction we use the "arrow" shape
plotTracks(grtrack, collapseTranscripts=T,
transcriptAnnotation = "symbol",
shape="arrow")
The collapseTranscripts function also allows us some additional options by which to collapse our transcripts.
These methods maintain the intron information in the gene model and so get us closer to reproducing the "collapsed" feature in IGV.
Here we may collapse transcripts to the "longest".
plotTracks(grtrack, collapseTranscripts="longest",
transcriptAnnotation = "symbol")
Or we may specify to collapseTranscripts function to collapse by "meta".
The "meta" option shows us a composite, lossless illustration of the gene models closest to that seen in "collapsed" IGV tracks.
Here importantly all exon information is retained.
plotTracks(grtrack, collapseTranscripts="meta",
transcriptAnnotation = "symbol")
We have seen in previous material how gene models are organised in Bioconductor using the TxDB objects.
Gviz may be used in junction with TxDB objects to construct the GeneRegionTrack objects.
We saw in the Bioconductor and ChIPseq course that many genomes have pre-build gene annotation within the respective TxDB libraries. Here we will load a TxDb for hg19 from the TxDb.Hsapiens.UCSC.hg19.knownGene library.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg19
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: knownGene
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: GRCh37
# transcript_nrow: 82960
# exon_nrow: 289969
# cds_nrow: 237533
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
# GenomicFeatures version at creation time: 1.21.30
# RSQLite version at creation time: 1.0.0
# DBSCHEMAVERSION: 1.1
Now we have loaded our TxDb object and assigned it to txdb. We can use this TxDb object to construct our GeneRegionTrack. Here we focus on chromosome 7 again.
customFromTxDb <- GeneRegionTrack(txdb,chromosome="chr7")
head(customFromTxDb)
GeneRegionTrack 'GeneRegionTrack'
| genome: hg19
| active chromosome: chr7
| annotation features: 6
With our new GeneRegionTrack we can now reproduce the gene models using the Bioconductor TxDb annotation.
Here the annotation is different but transcripts overlapping uc003syc are our SKAP2 gene.
plotTracks(customFromTxDb,
from=26591341,to=27034958,
transcriptAnnotation="gene")
Now by combining the ability to create our own TxDb objects from GFFs we can create a very custom GeneRegionTrack from a GFF file.
library(GenomicFeatures)
txdbFromGFF <- makeTxDbFromGFF(file = "~/Downloads/tophat2.gff")
customFromTxDb <- GeneRegionTrack(txdbFromGFF,chromosome="chr7")
plotTracks(customFromTxDb,
from=26591341,to=27034958,
transcriptAnnotation="gene")
Time for exercises! Link here
Time for solutions! Link here
id:seqtrack
When displaying genomics data it can be important to illustrate the underlying sequence for the genome being viewed.
Gviz uses SequenceTrack objects to handle displaying sequencing information.
First we need to get some sequence information for our genome of interest to display. Here we will use one of the BSgenome packages specific for hg19 - BSgenome.Hsapiens.UCSC.hg19. This contains the full sequence for hg19 as found in UCSC
library(BSgenome.Hsapiens.UCSC.hg19)
BSgenome.Hsapiens.UCSC.hg19[["chr7"]]
159138663-letter "DNAString" instance
seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
We can create a SequenceTrack object straight from this BSgenome object using the SequenceTrack() constructor.
We can then plot this SequenceTrack, as with all tracks, using the plotTracks() functions. Here we specify the from, to and chromosome parameters to select a region to display.
sTrack <- SequenceTrack(Hsapiens)
plotTracks(sTrack,from=134887024,to=134887074,
chromosome = "chr7",cex=2.5)
We can also specify a DNAstringset object which we have encountered in the Bioconductor and ChIP-seq courses.
dsSet <- DNAStringSet(Hsapiens[["chr7"]])
names(dsSet) <- "chr7"
sTrack <- SequenceTrack(dsSet)
plotTracks(sTrack,from=134887024,to=134887074,
chromosome = "chr7",cex=2.5)
We can also create our custom SequenceTrack from a Fasta file.
Here we use an example containing only the sequence around the region we are looking at to save space. Since the sequence is only of the region of interest we need specify the sequence limits for the from and to arguments. With completer fasta files, from and to would be set as for other SequenceTrack examples.
sTrack <- SequenceTrack("Data/chr7Short.fa")
plotTracks(sTrack,from=1,to=50,
chromosome = "chr7",cex=3)
As with IGV, the sequence can be displayed as its complement. This is performed here by setting the complement argument to the plotTracks() function to TRUE/T.
sTrack <- SequenceTrack("Data/chr7Short.fa")
plotTracks(sTrack,from=1,to=50,
chromosome = "chr7",complement=T,cex=3)
We can also add 5' to 3' direction as we have for plotting GenomeAxisTrack objects using the add53 parameter. This allows for a method to illustrate the strand of the sequence being diplayed.
sTrack <- SequenceTrack("Data/chr7Short.fa")
plotTracks(sTrack,from=1,to=50,
chromosome = "chr7",complement=F,
add53=T,cex=2.5)
Notice the 5' and 3' labels have swapped automatically when we have specified the complement sequence.
sTrack <- SequenceTrack("Data/chr7Short.fa")
plotTracks(sTrack,from=1,to=50,
chromosome = "chr7",complement=T,
add53=T,cex=2.5)
We can control the size of bases with the cex parameter, as with the standard R plotting.
An interesting feature of this is that when plotted bases overlap, Gviz will provide a colour representation of bases instead of the bases' characters.
plotTracks(sTrack,from=1,to=50,
chromosome = "chr7",cex=2.5)
plotTracks(sTrack,from=1,to=50,
chromosome = "chr7",
cex=5)
id:alignments
So far we have displayed summarised genomics data using GRange objects or GRanges with associated metadata.
A prominent feature of Gviz is that it can work with genomic alignments, providing methods to generate graphical summaries on the fly.
Genomic alignments are stored in Gviz within the AlignmentsTrack object.
Here we can read genomic alignments in from a BAM file, see our file formats course material, by specifying its location.
peakReads <- AlignmentsTrack("Data/small_Sorted_SRR568129.bam")
peakReads
ReferenceAlignmentsTrack 'AlignmentsTrack'
| genome: NA
| active chromosome: chrNA
| referenced file: Data/small_Sorted_SRR568129.bam
| mapping: id=id, cigar=cigar, mapq=mapq, flag=flag, isize=isize, groupid=groupid, status=status, md=md, seq=seq
The AlignmentsTrack object can be plotted in the same manner as tracks using plotTracks() function.
Since the BAM file may contain information from all chromosomes we need to specify a chromsome to plot in the chromosome parameter and here we specify the from and to parameters too.
plotTracks(peakReads,
chromosome="chr5",
from=135312577,
to=135314146)
plotTracks(peakReads,
chromosome="chr5",
from=135312577,
to=135314146)
By default AlignmentTracks are rendered as the both the reads themselves and the calculated coverage/signal depth from these reads.
Reads, as with AnnotationTrack objects, show the strand of the aligned read by the direction of the arrow.
The type of plot/plots produced can be controlled by the type argument as we have done for DataTrack objects.
The valid types of plots for AlignmentsTrack objects are "pileup", "coverage" and "sashimi" (We've come across sashimi plots before).
The type "pileup" displays just the reads.
plotTracks(peakReads,
chromosome="chr5",
from=135312577,
to=135314146,
type="pileup")
The type "coverage" displays just the coverage (depth of signal over genomic positions) calculated from the genomic alignments.
plotTracks(peakReads,
chromosome="chr5",
from=135312577,
to=135314146,
type="coverage")
As we have seen the default display is a combination of "pileup" and "coverage".
We can provide multiple type arguments to the plotTracks() function as a vector of valid types. The order in vector here does not affect the display order in panels.
We have seen sashimi plots in IGV when reviewing RNA-seq data.
Sashimi plots display the strength of signal coming from reads spanning splice junctions and so can act to illustrate changes in exon usage between samples.
In IGV, we previous made use of the BodyMap data to show alternative splicing of an exon between heart and liver.
To recapitulate this plot, we have retrieved the subsection of BodyMap data as BAM files.
First we must create two AlignmentsTrack objects, one for each tissue's BAM file of aligned reads.
In this case since we are working with paired-end reads we must specify this by setting the isPaired parameter to TRUE
heartReads <- AlignmentsTrack("Data/heart.bodyMap.bam",
isPaired = TRUE)
liverReads <- AlignmentsTrack("Data/liver.bodyMap.bam",
isPaired = TRUE)
liverReads
ReferenceAlignmentsTrack 'AlignmentsTrack'
| genome: NA
| active chromosome: chrNA
| referenced file: Data/liver.bodyMap.bam
| mapping: id=id, cigar=cigar, mapq=mapq, flag=flag, isize=isize, groupid=groupid, status=status, md=md, seq=seq
As with DataTrack objects we can combine the AlignmentTracks as a vector for plotting with the plotTracks() function.
By default we will display the reads and calculated coverage. Here the paired reads and split reads are illustrated by thick and thin lines respectively.
plotTracks(c(heartReads,liverReads),
chromosome="chr12",
from=98986825,to=98997877)
To reproduce a plot similar to that in IGV we can simply include the "sashimi" type in the type parameter vector, here alongside "coverage"
plotTracks(c(heartReads,liverReads),
chromosome="chr12",
from=98986825,
to=98997877,
type=c("coverage","sashimi"))
The AlignmentTrack object allows for specific parameters controlling how reads are displayed to be passed to the plotTracks() function.
A few useful parameters are col.gaps and col.mates or lty.gap and lty.mates which will allow us to better distinguish between gapped alignments (split reads) and gaps between read pairs respectively.
plotTracks(c(liverReads),
chromosome="chr12",
from=98986825,to=98997877,
col.gap="Red",col.mate="Blue")
Similarly using lty.gap and lty.mate parameters.
plotTracks(c(liverReads),
chromosome="chr12",
from=98986825,to=98997877,
lty.gap=2,lty.mate=1)
Line width may also be controlled with lwd.gap and lwd.mate parameters continuing the similarities to Base R plotting.
A common purpose in visualising alignment data in broswers is review information relating to mismatches to the genome which may be related to SNPs.
In order to highlight mismatches to the genome reference sequence we must first provide some information on the reference sequence.
One method for this is to attach sequence information to the AlignmentsTrack itself by providing a SequenceTrack object to referenceSequence parameter in the AlignmentsTrack() constructor. Here we can use the SequenceTrack object we made earlier.
sTrack <- SequenceTrack(Hsapiens)
heartReads <- AlignmentsTrack("Data/heart.bodyMap.bam",
isPaired = TRUE,
referenceSequence=sTrack)
Now when we can replot the pileup of reads where mismatches in the reads are highlighted.
plotTracks(heartReads,
chromosome="chr12",
from=98987800,to=98987977,
type="pileup")
We could also specify the SequenceTrack in the plotTracks() function as shown for the liver reads example here. Here we simply include the relevant SequenceTrack object as a track to be plotted alongside the BAM.
plotTracks(c(liverReads,sTrack),
chromosome="chr12",
from=98987800,
to=98987977,
type="pileup")
Time for exercises! Link here
Time for solutions! Link here
id: externaldata
Gviz has functions to allow us to import data from external repositories and databases.
As in the IGV course, visualising genomics data in the context of additional genome information and external data held at these repositories provides a deeper insight into our own data.
In this course we will look at two main methods of querying external databases-
- The BiomartGeneRegionTrack object and constructor.
- The UcscTrack object and constructor
We have previously seen how we can use the biomaRt Bioconductor package to programatically query various Biomarts (see our previous material).
Gviz allows us to both query Biomarts and automatically create a GeneRegionTrack using the BiomartGeneRegionTrack objects and BiomartGeneRegionTrack() constructor.
Here we construct a simple BiomartGeneRegionTrack object using the parameters to define locations of interest - "chromsome", "start","end","genome" as well as the Biomart to use, in this case Ensembl by setting the name parameter.
bgrTrack <- BiomartGeneRegionTrack(genome="hg19",
start=26591341,
end=27034958,
chromosome = "chr7",
name="ENSEMBL")
We can then plot the BiomartGeneRegionTrack as we have previous GeneRegionTracks.
plotTracks(bgrTrack)
We can also specify filters in the BiomartGeneRegionTrack() constructor using the filter parameter.
Gviz uses the BiomaRt Bioconductor package to query the Biomarts so we can apply the same filters as in BiomaRt (which we saw in our earlier material).
library(biomaRt)
mart = useMart("ensembl", dataset="hsapiens_gene_ensembl")
listFilters(mart)
name
1 chromosome_name
2 start
3 end
4 band_start
5 band_end
6 marker_start
7 marker_end
8 encode_region
9 strand
10 chromosomal_region
11 with_hgnc
12 with_hgnc_transcript_name
13 with_ox_arrayexpress
14 with_ccds
15 with_chembl
16 with_ox_clone_based_ensembl_gene
17 with_ox_clone_based_ensembl_transcript
18 with_ox_clone_based_vega_gene
19 with_ox_clone_based_vega_transcript
20 with_dbass3
21 with_dbass5
22 with_ens_hs_gene
23 with_ens_hs_translation
24 with_ens_hs_transcript
25 with_ens_lrg_gene
26 with_ens_lrg_transcript
27 with_epd
28 with_embl
29 with_entrezgene
30 with_entrezgene_transcript_name
31 with_genedb
32 with_go_id
33 with_go_go
34 with_goslim_goa
35 with_hpa
36 with_merops
37 with_mim_gene
38 with_mim_morbid
39 with_mirbase
40 with_mirbase_transcript_name
41 with_ottg
42 with_ottt
43 with_ottp
44 with_pdb
45 with_protein_id
46 with_reactome
47 with_reactome_gene
48 with_reactome_transcript
49 with_rfam
50 with_rfam_transcript_name
51 with_refseq_mrna
52 with_refseq_mrna_predicted
53 with_refseq_ncrna
54 with_refseq_ncrna_predicted
55 with_refseq_peptide
56 with_refseq_peptide_predicted
57 with_rnacentral
58 with_ucsc
59 with_unigene
60 with_uniparc
61 with_uniprot_genename
62 with_uniprotswissprot
63 with_uniprotsptrembl
64 with_wikigene
65 ensembl_gene_id
66 ensembl_transcript_id
67 ensembl_peptide_id
68 ensembl_exon_id
69 hgnc_id
70 hgnc_symbol
71 hgnc_transcript_name
72 external_gene_name
73 arrayexpress
74 ccds
75 chembl
76 clone_based_ensembl_gene_name
77 clone_based_ensembl_transcript_name
78 clone_based_vega_gene_name
79 clone_based_vega_transcript_name
80 dbass3_name
81 dbass5_name
82 embl
83 ens_hs_transcript
84 ens_hs_translation
85 ens_lrg_gene
86 ens_lrg_transcript
87 entrezgene
88 entrezgene_transcript_name
89 epd
90 go_id
91 goslim_goa_accession
92 hpa
93 genedb
94 merops
95 mim_gene_accession
96 mim_morbid
97 mirbase_id
98 mirbase_accession
99 mirbase_transcript_name
100 pdb
101 protein_id
102 reactome
103 reactome_gene
104 reactome_transcript
105 refseq_mrna
106 refseq_mrna_predicted
107 refseq_ncrna
108 refseq_ncrna_predicted
109 refseq_peptide
110 refseq_peptide_predicted
111 rfam
112 rfam_transcript_name
113 rnacentral
114 ucsc
115 uniprot_sptrembl
116 uniprot_swissprot
117 unigene
118 uniprot_genename
119 uniparc
120 ottg
121 ottt
122 ottp
123 wikigene_id
124 wikigene_name
125 with_affy_huex_1_0_st_v2
126 with_affy_hc_g110
127 with_affy_hg_focus
128 with_affy_u133_x3p
129 with_affy_hg_u133a
130 with_affy_hg_u133a_2
131 with_affy_hg_u133_plus_2
132 with_affy_hg_u133b
133 with_affy_hg_u95a
134 with_affy_hg_u95av2
135 with_affy_hg_u95b
136 with_affy_hg_u95c
137 with_affy_hg_u95d
138 with_affy_hg_u95e
139 with_affy_hugenefl
140 with_affy_hugene_1_0_st_v1
141 with_affy_hugene_2_0_st_v1
142 with_affy_hta_2_0
143 with_affy_primeview
144 with_agilent_cgh_44b
145 with_efg_agilent_wholegenome_4x44k_v1
146 with_efg_agilent_wholegenome_4x44k_v2
147 with_efg_agilent_sureprint_g3_ge_8x60k
148 with_efg_agilent_sureprint_g3_ge_8x60k_v2
149 with_codelink_codelink
150 with_phalanx_onearray
151 with_illumina_humanwg_6_v1
152 with_illumina_humanwg_6_v2
153 with_illumina_humanwg_6_v3
154 with_illumina_humanht_12_v3
155 with_illumina_humanht_12_v4
156 with_illumina_humanref_8_v3
157 with_illumina_humanmethylation27
158 with_illumina_humanmethylation450
159 affy_hc_g110
160 affy_hg_focus
161 affy_hg_u95a
162 affy_hg_u95av2
163 affy_hg_u95b
164 affy_hg_u95c
165 affy_hg_u95d
166 affy_hg_u95e
167 affy_hg_u133a_2
168 affy_hg_u133a
169 affy_hg_u133b
170 affy_hg_u133_plus_2
171 affy_hta_2_0
172 affy_hugenefl
173 affy_hugene_1_0_st_v1
174 affy_hugene_2_0_st_v1
175 affy_huex_1_0_st_v2
176 affy_primeview
177 affy_u133_x3p
178 codelink
179 agilent_cgh_44b
180 efg_agilent_sureprint_g3_ge_8x60k
181 efg_agilent_sureprint_g3_ge_8x60k_v2
182 efg_agilent_wholegenome_4x44k_v1
183 efg_agilent_wholegenome_4x44k_v2
184 illumina_human_methylation_27
185 illumina_human_methylation_450
186 illumina_humanht_12_v3
187 illumina_humanht_12_v4
188 illumina_humanref_8_v3
189 illumina_humanwg_6_v1
190 illumina_humanwg_6_v2
191 illumina_humanwg_6_v3
192 phalanx_onearray
193 transcript_count
194 biotype
195 transcript_biotype
196 source
197 transcript_source
198 status
199 transcript_status
200 transcript_tsl
201 transcript_gencode_basic
202 transcript_appris
203 phenotype_description
204 phenotype_source
205 go_evidence_code
206 go_parent_term
207 go_parent_name
208 with_paralog_hsap
209 with_homolog_vpac
210 with_homolog_pfor
211 with_homolog_acar
212 with_homolog_dnov
213 with_homolog_gmor
214 with_homolog_ogar
215 with_homolog_cele
216 with_homolog_fcat
217 with_homolog_amex
218 with_homolog_ggal
219 with_homolog_ptro
220 with_homolog_psin
221 with_homolog_cint
222 with_homolog_csav
223 with_homolog_lcha
224 with_homolog_sara
225 with_homolog_btau
226 with_homolog_cfam
227 with_homolog_ttru
228 with_homolog_apla
229 with_homolog_dmel
230 with_homolog_lafr
231 with_homolog_mfur
232 with_homolog_falb
233 with_homolog_trub
234 with_homolog_nleu
235 with_homolog_ggor
236 with_homolog_cpor
237 with_homolog_eeur
238 with_homolog_ecab
239 with_homolog_dord
240 with_homolog_pmar
241 with_homolog_etel
242 with_homolog_mmul
243 with_homolog_cjac
244 with_homolog_olat
245 with_homolog_pvam
246 with_homolog_mluc
247 with_homolog_mmus
248 with_homolog_mmur
249 with_homolog_onil
250 with_homolog_panu
251 with_homolog_mdom
252 with_homolog_pabe
253 with_homolog_amel
254 with_homolog_sscr
255 with_homolog_opri
256 with_homolog_xmac
257 with_homolog_oana
258 with_homolog_ocun
259 with_homolog_rnor
260 with_homolog_pcap
261 with_homolog_oari
262 with_homolog_chof
263 with_homolog_locu
264 with_homolog_itri
265 with_homolog_gacu
266 with_homolog_tsyr
267 with_homolog_shar
268 with_homolog_tnig
269 with_homolog_tbel
270 with_homolog_mgal
271 with_homolog_csab
272 with_homolog_meug
273 with_homolog_xtro
274 with_homolog_scer
275 with_homolog_tgut
276 with_homolog_drer
277 with_coil
278 with_gene3d
279 with_hamap
280 with_hmmpanther
281 with_interpro
282 with_low_complexity
283 with_protein_feature_pfam
284 with_profile
285 with_pirsf
286 with_protein_feature_prints
287 with_prosite
288 with_smart
289 with_superfamily
290 with_tigrfam
291 with_signalp
292 with_tmhmm
293 with_blastprodom
294 tigrfam
295 gene3d
296 hamap
297 superfamily
298 smart
299 pirsf
300 family
301 pfam
302 hmmpanther
303 prints
304 profile
305 prosite
306 interpro
307 germ_line_variation_source
308 somatic_variation_source
309 with_validated_snp
310 so_parent_name
description
1 Chromosome name
2 Gene Start (bp)
3 Gene End (bp)
4 Band Start
5 Band End
6 Marker Start
7 Marker End
8 Encode region
9 Strand
10 Chromosome Regions (e.g 1:100:10000:-1,1:100000:200000:1)
11 with HGNC ID(s)
12 with HGNC transcript name(s)
13 with ArrayExpress ID(s)
14 with CCDS ID(s)
15 with ChEMBL ID(s)
16 with clone based Ensembl gene ID(s)
17 with clone based Ensembl transcript ID(s)
18 with clone based VEGA gene ID(s)
19 with clone based VEGA transcript ID(s)
20 with DBASS3 ID(s)
21 with DBASS5 ID(s)
22 with Ensembl Human Gene IDs
23 with Ensembl Human Translation IDs
24 with Ensembl Human Transcript IDs
25 with Ensembl LRG gene ID(s)
26 with Ensembl LRG transcript ID(s)
27 with EPD ID(s)
28 with EMBL ID(s)
29 with EntrezGene ID(s)
30 with EntrezGene Transcript Name(s)
31 with GeneDB ID(s)
32 with GO Term Accession(s)
33 with GO ID(s)
34 with GOSlim GOA(s)
35 with Human Protein Atlas ID(s)
36 with MEROPS ID(s)
37 with MIM gene ID(s)
38 with MIM MORBID ID(s)
39 with miRBase ID(s)
40 with miRBase transcript name(s)
41 with VEGA gene ID(s) (OTTG)
42 with VEGA transcript ID(s) (OTTT)
43 with VEGA protein ID(s) (OTTP)
44 with PDB ID(s)
45 with protein (Genbank) ID(s)
46 with Reactome ID(s)
47 with Reactome gene ID(s)
48 with Reactome transcript ID(s)
49 with Rfam ID(s)
50 with Rfam transcript name(s)
51 with RefSeq mRNA ID(s)
52 with RefSeq mRNA predicted ID(s)
53 with RefSeq ncRNA ID(s)
54 with RefSeq ncRNA predicted ID(s)
55 with RefSeq protein ID(s)
56 with RefSeq predicted protein ID(s)
57 with RNACentral ID(s)
58 with UCSC ID(s)
59 with UniGene ID(s)
60 with UniParc ID(s)
61 with UniProt Gene Name(s)
62 with UniProt/SwissProt Accession(s)
63 with UniProt/TrEMBL Accession(s)
64 with WikiGene ID(s)
65 Ensembl Gene ID(s) [e.g. ENSG00000139618]
66 Ensembl Transcript ID(s) [e.g. ENST00000380152]
67 Ensembl protein ID(s) [e.g. ENSP00000369497]
68 Ensembl exon ID(s) [e.g. ENSE00001508081]
69 HGNC ID(s) [e.g. HGNC:8030]
70 HGNC symbol(s) [e.g. NTN3]
71 HGNC transcript name(s) [e.g. QRSL1P2-001]
72 Associated Gene Name(s) [e.g. BRCA2]
73 ArrayExpress ID(s) [e.g. ENSG00000241328]
74 CCDS ID(s) [e.g. CCDS10187]
75 ChEMBL ID(s) ID(s) [e.g. CHEMBL1075092]
76 Clone based Ensembl gene name(s) [e.g. AL162430.1]
77 Clone based Ensembl transcript name(s) [e.g. AL162430.1-201]
78 Clone based VEGA gene name(s) [e.g. RP11-815M8.1]
79 Clone based VEGA transcript name(s) [e.g. RP5-859I17.3-001]
80 DBASS3 Gene Name [e.g. PDE6B]
81 DBASS5 Gene Name [e.g. HCN2]
82 EMBL ID(s) [e.g. AY495257]
83 Ensembl Human Transcript IDs [e.g. ENST00000225964]
84 Ensembl Human Translation IDs [e.g. ENSP00000376544]
85 LRG to Ensembl link gene IDs [e.g. LRG_3]
86 LRG to Ensembl link transcript IDs [e.g. LRG_226t1]
87 EntrezGene ID(s) [e.g. 115286]
88 EntrezGene transcript name ID(s) [e.g. CTD-2350J17.1-002]
89 Eukaryotic Promoter Database (EPD) ID(s) [e.g. 11050]
90 GO Term Accession(s) [e.g. GO:0005515]
91 GOSlim GOA Accessions(s) [e.g. GO:0005623]
92 Human Protein Atlas Antibody ID [e.g. HPA002549]
93 GeneDB ID [e.g. LmjF.10.1080:pep]
94 MEROPS ID(s) [e.g. C19.028]
95 MIM Gene Accession(s) [e.g. 611882]
96 MIM MORBID ID(s) [e.g. 100100]
97 miRBase ID(s) [e.g. hsa-mir-137]
98 miRBase Accession(s) [e.g. MI0000454]
99 miRBase transcript name [e.g. hsa-mir-6724-1.3-201]
100 PDB ID(s) [e.g. 1J47]
101 Protein (Genbank) ID(s) [e.g. ACU09872]
102 Reactome ID(s) [e.g. R-HSA-392499]
103 Reactome gene ID(s) [e.g. R-HSA-381070]
104 Reactome transcript ID(s) [e.g. R-HSA-5368287]
105 RefSeq mRNA ID(s) [e.g. NM_001195597]
106 RefSeq Predicted mRNA ID(s) [e.g. XM_006724158]
107 RefSeq ncRNA ID(s) [e.g. NR_125810]
108 RefSeq Predicted ncRNA ID(s) [e.g. XR_251015]
109 RefSeq protein ID(s) [e.g. NP_001005353]
110 RefSeq predicted protein ID(s) [e.g. XP_011520427]
111 Rfam ID(s) [e.g. RF00432]
112 Rfam transcript name(s) [e.g. Y_RNA.837-201]
113 RNACentral ID(s) [e.g. URS000019B707]
114 UCSC ID(s) [e.g. uc002cqj.3]
115 UniProt/TrEMBL Accession(s) [e.g. U5Z754]
116 UniProt/Swissprot Accession(s) [e.g. Q13068]
117 UniGene ID(s) [e.g. Hs.146092]
118 UniProt Accession ID(s) [e.g. P03886]
119 UniParc ID(s) [e.g. UPI0000000AA1]
120 VEGA Gene ID(s) (OTTG) [e.g. OTTHUMG00000036159]
121 VEGA Transcript ID(s) (OTTT) [e.g. OTTHUMT00000088063]
122 VEGA Protein ID(s) (OTTP) [e.g. OTTHUMP00000277309]
123 WikiGene ID(s) [e.g. 115286]
124 WikiGene Name(s) [e.g. SLC25A26]
125 with Affymetrix Microarray huex 1 0 st v2 probeset ID(s)
126 with Affymetrix Microarray hc g110 probeset ID(s)
127 with Affymetrix Microarray hg Focus probeset ID(s)
128 with Affymetrix Microarray u133 x3p probeset ID(s)
129 with Affymetrix Microarray hg u133a probeset ID(s)
130 with Affymetrix Microarray hg u133a 2 probeset ID(s)
131 with Affymetrix Microarray hg u133 plus 2 probeset ID(s)
132 with Affymetrix Microarray hg u133b probeset ID(s)
133 with Affymetrix Microarray hg u95a probeset ID(s)
134 with Affymetrix Microarray hg u95av2 ID(s) probeset
135 with Affymetrix Microarray hg u95b probeset ID(s)
136 with Affymetrix Microarray hg u95c probeset ID(s)
137 with Affymetrix Microarray hg u95d probeset ID(s)
138 with Affymetrix Microarray hg u95e probeset ID(s)
139 with Affymetrix Microarray HuGeneFL probeset ID(s)
140 with Affymetrix Microarray hugene 1 0 st v1 probeset ID(s)
141 with Affymetrix Microarray hugene 2 0 st v1 ID(s)
142 with Affymetrix Microarray HTA-2_0 probeset ID(s)
143 with Affymetrix Microarray primeview ID(s)
144 with Agilent CGH 44b probe ID(s)
145 with Efg agilent wholegenome 4x44k v1 ID(s)
146 with Efg agilent wholegenome 4x44k v2 ID(s)
147 with Efg agilent sureprint g3 ge 8x60k ID(s)
148 with Efg agilent sureprint g3 ge 8x60k v2 ID(s)
149 with Codelink probe ID(s)
150 with Phalanx onearray probe ID(s)
151 with Illumina HumanWG 6 v1 probe ID(s)
152 with Illumina HumanWG 6 v2 probe ID(s)
153 with Illumina HumanWG 6 v3 probe ID(s)
154 with Illumina Human HT 12 v3 probe ID(s)
155 with Illumina Human HT 12 v4 probe ID(s)
156 with Illumina Human HT 8 v3 probe ID(s)
157 with Illumina human methylation 27 probe ID(s)
158 with Illumina human methylation 450 probe ID(s)
159 Affy hc g110 probeset ID(s) [e.g. 113_i_at]
160 Affy hg focus probeset ID(s) [e.g. 201612_at]
161 Affy hg u95a probeset ID(s) [e.g. 32647_at]
162 Affy hg u95av2 probeset ID(s) [e.g. 32647_at]
163 Affy hg u95b probeset ID(s) [e.g. 53925_at]
164 Affy hg u95c probeset ID(s) [e.g. 61056_r_at]
165 Affy hg u95d probeset ID(s) [e.g. 79632_at]
166 Affy hg u95e probeset ID(s) [e.g. 79965_at]
167 Affy hg u133a 2 probeset ID(s) [e.g. 200874_s_at]
168 Affy hg u133a probeset ID(s) [e.g. 200874_s_at]
169 Affy hg u133b probeset ID(s) [e.g. 227057_at]
170 Affy hg u133 plus 2 probeset ID(s) [e.g. 241843_at]
171 Affy HTA_2_0 probeset ID(s) [e.g. TC04000093.hg]
172 Affy HuGene FL probeset ID(s) [e.g. M58525_s_at]
173 Affy HuGene 1_0 st v1 probeset ID(s) [e.g. 8065566]
174 Affy HuGene 2_0 st v1 probeset ID(s) [e.g. 16964973]
175 Affy HuEx 1_0 st v2 probeset ID(s) [e.g. 4033465]
176 Affymetrix Microarray Primeview ID(s) [e.g. 11763890_at]
177 Affy u133 x3p probeset ID(s) [e.g. Hs2.205326.1.A1_3p_at]
178 Codelink probe ID(s) [e.g. GE550734]
179 Agilent CGH 44b probe ID(s) [e.g. A_14_P131077]
180 Agilent Sureprint G3 GE 8x60k probe ID(s) [e.g. A_33_P3356022]
181 Agilent Sureprint G3 GE 8x60k v2 probe ID(s) [e.g. A_24_P182122]
182 Agilent WholeGenome 4x44k v1 probe ID(s) [e.g. A_32_P196615]
183 Agilent WholeGenome 4x44k v2 probe ID(s) [e.g. A_33_P3356022]
184 Illumina Human methylation 27 probe ID(s) [e.g. cg20103550]
185 Illumina Human methylation 450 probe ID(s) [e.g. cg26891645]
186 Illumina Human HT 12 v3 probe ID(s) [e.g. ILMN_2079225]
187 Illumina Human HT 12 v4 probe ID(s) [e.g. ILMN_2079225]
188 Illumina Human Ref 8 v3 probe ID(s) [e.g. ILMN_1768251]
189 Illumina HumanWG 6 V1 probe ID(s) [e.g. 0000940471]
190 Illumina HumanWG 6 V2 probe ID(s) [e.g. ILMN_1748182]
191 Illumina HumanWG 6 v3 probe ID(s) [e.g. ILMN_2103362]
192 Phalanx OneArray probe ID(s) [e.g. PH_hs_0031946]
193 Transcript count >=
194 Type
195 Transcript Type
196 Source (gene)
197 Source (transcript)
198 Status (gene)
199 Status (transcript)
200 Transcript Support Level (TSL)
201 GENCODE basic annotation
202 APPRIS annotation
203 Phenotype description
204 Phenotype source
205 GO Evidence code
206 Parent term accession
207 Parent term name
208 Paralogous Human Genes
209 Orthologous Alpaca Genes
210 Orthologous Amazon molly Genes
211 Orthologous Anole Lizard Genes
212 Orthologous Armadillo Genes
213 Orthologous Atlantic Cod Genes
214 Orthologous Bushbaby Genes
215 Orthologous Caenorhabditis elegans Genes
216 Orthologous Cat Genes
217 Orthologous Cave fish Genes
218 Orthologous Chicken Genes
219 Orthologous Chimpanzee Genes
220 Orthologous Chinese softshell turtle Genes
221 Orthologous Ciona intestinalis genes
222 Orthologous Ciona savignyi Genes
223 Orthologous Coelacanth Genes
224 Orthologous Common Shrew Genes
225 Orthologous Cow Genes
226 Orthologous Dog Genes
227 Orthologous Dolphin Genes
228 Orthologous Duck Genes
229 Orthologous Drosophila Genes
230 Orthologous Elephant Genes
231 Orthologous Ferret Genes
232 Orthologous Flycatcher Genes
233 Orthologous Fugu Genes
234 Orthologous Gibbon Genes
235 Orthologous Gorilla Genes
236 Orthologous Guinea Pig Genes
237 Orthologous Hedgehog Genes
238 Orthologous Horse Genes
239 Orthologous Kangaroo Rat Genes
240 Orthologous Lamprey Genes
241 Orthologous Lesser hedgehog tenrec Genes
242 Orthologous Macaque Genes
243 Orthologous Marmoset Genes
244 Orthologous Medaka Genes
245 Orthologous Megabat Genes
246 Orthologous Microbat Genes
247 Orthologous Mouse Genes
248 Orthologous Mouse Lemur Genes
249 Orthologous Nile tilapia Genes
250 Orthologous Olive baboon Genes
251 Orthologous Opossum Genes
252 Orthologous Orangutan Genes
253 Orthologous Panda Genes
254 Orthologous Pig Genes
255 Orthologous Pika Genes
256 Orthologous Platyfish Genes
257 Orthologous Platypus Genes
258 Orthologous Rabbit Genes
259 Orthologous Rat Genes
260 Orthologous Rock Hyrax Genes
261 Orthologous Sheep Genes
262 Orthologous Sloth Genes
263 Orthologous Spotted gar Genes
264 Orthologous Squirrel Genes
265 Orthologous Stickleback Genes
266 Orthologous Tarsier Genes
267 Orthologous Tasmanian Devil Genes
268 Orthologous Tetraodon Genes
269 Orthologous Tree Shrew Genes
270 Orthologous Turkey Genes
271 Orthologous Vervet-AGM Genes
272 Orthologous Wallaby Genes
273 Orthologous Xenopus Genes
274 Orthologous Yeast Genes
275 Orthologous Zebra Finch Genes
276 Orthologous Zebrafish Genes
277 with coiled coil (ncoils)
278 with Gene3D ID(s)
279 with HAMAP ID(s)
280 with HMMPanther ID(s)
281 with InterPro ID(s)
282 with low complexity (SEG)
283 with Pfam ID(s)
284 with Pfscan ID(s)
285 with PIRSF ID(s)
286 with PRINTS ID(s)
287 with ScanProsite ID(s)
288 with SMART ID(s)
289 with SUPERFAMILY ID(s)
290 with TIGRFAM ID(s)
291 with signal peptide
292 with Transmembrane domain (tmhmm)
293 with Protein feature blastprodom ID(s)
294 TIGRfam ID(s) [e.g. TIGR00172]
295 Gene3D ID(s) [e.g. 1.20.210.10]
296 HAMAP Accession ID(s) [e.g. MF_01209]
297 Superfamily ID(s) [e.g. SSF47095]
298 SMART ID(s) [e.g. SM00398]
299 PIRSF ID(s) [e.g. PIRSF037653]
300 Ensembl Protein Family ID(s) [e.g. PTHR10000_SF7]
301 PFAM ID(s) [e.g. PF00046]
302 HMMPanther ID(s) [e.g. PTHR12369]
303 PRINTS ID(s) [e.g. PR00194]
304 PROFILE ID(s) [e.g. PS50855]
305 PROSITE ID(s) [e.g. PS00668]
306 Interpro ID(s) [e.g. IPR007087]
307 limit to genes with germline variant data sources
308 limit to genes with somatic variant sources
309 Variant supporting evidence
310 Parent term name
Here we select only genes which have been annotated by both havana and ensembl (so called Golden Transcripts)
bgrTrack <- BiomartGeneRegionTrack(genome="hg19",
start=26591341,
end=27034958,
chromosome = "chr7",
name="ENSEMBL",
filter=list(source="ensembl_havana"))
Once we have retrieved our filtered gene models we can plot them as before.
plotTracks(bgrTrack)
A well known browser and source of genomic data and annotation is the UCSC genome browser. Gviz can create track directly from UCSC tables using the functionality from rtracklayer Bioconductor package.
The Ucsctrack() constructor and object allow for the query and track construction of a variety of data types. The Ucsctrack() function therefore requires us to specify the track type we expect using the trackType parameter as well as the required UCSC table using the track parameter.
To understand which tables are available we can query the rtracktables package to identify track and table names.
library(rtracklayer)
session <- browserSession()
genome(session) <- "hg19"
trackNames(session)
Base Position Alt Haplotypes
"ruler" "altLocations"
Assembly BAC End Pairs
"gold" "bacEndPairs"
BU ORChID Chromosome Band
"wgEncodeBuOrchid" "cytoBand"
deCODE Recomb ENCODE Pilot
"decodeRmap" "encodeRegions"
FISH Clones Fosmid End Pairs
"fishClones" "fosEndPairs"
Gap GC Percent
"gap" "gc5Base"
GRC Incident GRC Map Contigs
"grcIncidentDb" "ctgPos2"
GRC Patch Release Hg18 Diff
"altSeqComposite10" "hg19ContigDiff"
Hg38 Diff Hi Seq Depth
"hg38ContigDiff" "hiSeqDepth"
INSDC LRG Regions
"ucscToINSDC" "lrg"
Map Contigs Mappability
"ctgPos" "wgEncodeMapability"
Recomb Rate Restr Enzymes
"recombRate" "cutters"
Short Match STS Markers
"oligoMatch" "stsMap"
UCSC Genes RefSeq Genes
"knownGene" "refGene"
AceView Genes Augustus
"acembly" "augustusGene"
CCDS Ensembl Genes
"ccdsGene" "ensGene"
EvoFold Exoniphy
"evofold" "exoniphy"
GENCODE... Geneid Genes
"wgEncodeGencodeSuper" "geneid"
Genscan Genes H-Inv 7.0
"genscan" "hinv70Composite"
IKMC Genes Mapped lincRNAs...
"hgIkmc" "lincRNAs"
LRG Transcripts MGC Genes
"lrgTranscriptAli" "mgcFullMrna"
N-SCAN Old UCSC Genes
"nscanGene" "knownGeneOld6"
ORFeome Clones Other RefSeq
"orfeomeMrna" "xenoRefGene"
Pfam in UCSC Gene Retroposed Genes
"ucscGenePfam" "ucscRetroAli5"
SGP Genes SIB Genes
"sgpGene" "sibGene"
sno/miRNA TransMap...
"wgRna" "transMap"
tRNA Genes UCSC Alt Events
"tRNAs" "knownAlt"
UniProt Vega Genes
"spUniprot" "vegaGeneComposite"
Yale Pseudo60 Publications
"pseudoYale60" "pubs"
ClinGen CNVs ClinVar Variants
"iscaComposite" "clinvar"
Coriell CNVs COSMIC
"coriellDelDup" "cosmic"
DECIPHER Development Delay
"decipher" "cnvDevDelay"
GAD View GeneReviews
"gad" "geneReviews"
GWAS Catalog HGMD Variants
"gwasCatalog" "hgmd"
Lens Patents LOVD Variants
"patSeq" "lovd"
MGI Mouse QTL OMIM AV SNPs
"jaxQtlMapped" "omimAvSnp"
OMIM Genes OMIM Pheno Loci
"omimGene2" "omimLocation"
RGD Human QTL RGD Rat QTL
"rgdQtl" "rgdRatQtl"
UniProt Variants Web Sequences
"spMut" "pubsBingBlat"
Human mRNAs Spliced ESTs
"mrna" "intronEst"
CGAP SAGE Gene Bounds
"cgapSage" "rnaCluster"
H-Inv Human ESTs
"HInvGeneMrna" "est"
Human RNA Editing Other ESTs
"darned" "xenoEst"
Other mRNAs Poly(A)
"xenoMrna" "polyA"
PolyA-Seq SIB Alt-Splicing
"polyASeqSites" "sibTxGraph"
UniGene GTEx
"uniGene_3" "gtexGene"
Affy Exon Array Affy GNF1H
"affyExonArray" "affyGnf1h"
Affy RNA Loc Affy U133
"wgEncodeAffyRnaChip" "affyU133"
Affy U133Plus2 Affy U95
"affyU133Plus2" "affyU95"
Allen Brain Burge RNA-seq
"allenBrainAli" "burgeRnaSeqGemMapperAlign"
CSHL Small RNA-seq ENC Exon Array...
"wgEncodeCshlShortRnaSeq" "wgEncodeExonArraySuper"
ENC ProtGeno... ENC RNA-seq...
"wgEncodeProtGenoSuper" "wgEncodeRnaSeqSuper"
GIS RNA PET GNF Atlas 2
"wgEncodeGisRnaPet" "gnfAtlas2"
GWIPS-viz Riboseq Illumina WG-6
"gwipsvizRiboseq" "illuminaProbes"
PeptideAtlas qPCR Primers
"peptideAtlas2014" "qPcrPrimers"
RIKEN CAGE Loc Sestan Brain
"wgEncodeRikenCage" "sestanBrainAtlas"
ENCODE Regulation... CD34 DnaseI
"wgEncodeReg" "eioJcviNAS"
CpG Islands... ENC Chromatin...
"cpgIslandSuper" "wgEncodeChromSuper"
ENC DNA Methyl... ENC DNase/FAIRE...
"wgEncodeDnaMethylSuper" "wgEncodeDNAseSuper"
ENC Histone... ENC RNA Binding...
"wgEncodeHistoneSuper" "wgEncodeRbpSuper"
ENC TF Binding... FSU Repli-chip
"wgEncodeTfBindingSuper" "wgEncodeFsuRepliChip"
Genome Segments NKI Nuc Lamina...
"wgEncodeAwgSegmentation" "laminB1Super"
ORegAnno Stanf Nucleosome
"oreganno" "wgEncodeSydhNsome"
SUNY SwitchGear SwitchGear TSS
"wgEncodeSunySwitchgear" "switchDbTss"
TFBS Conserved TS miRNA sites
"tfbsConsSites" "targetScanS"
UCSF Brain Methyl UMMS Brain Hist
"ucsfBrainMethyl" "uMassBrainHistone"
UW Repli-seq Vista Enhancers
"wgEncodeUwRepliSeq" "vistaEnhancers"
Conservation Cons 46-Way
"cons100way" "cons46way"
Cons Indels MmCf Evo Cpg
"consIndelsHgMmCanFam" "evoCpg"
GERP phastBias gBGC
"allHg19RS_BW" "phastBias"
Primate Chain/Net Placental Chain/Net
"primateChainNet" "placentalChainNet"
Vertebrate Chain/Net Gorilla Chain/Net
"vertebrateChainNet" "chainNetGorGor3"
5% Lowest S H-C Coding Diffs
"ntSssTop5p" "ntHumChimpCodingDiff"
Neandertal Methyl Neandertal Seq
"neandertalMethylation" "ntSeqReads"
S SNPs Sel Swp Scan (S)
"ntSssSnps" "ntSssZScorePMVar"
Denisova Methyl Denisova Seq
"denisovaMethylation" "dhcBamDenisova"
Denisova Variants Mod Hum Variants
"dhcVcfDenisovaPinky" "dhcVcfModern"
Modern Derived Common SNPs(146)
"dhcHumDerDenAnc" "snp146Common"
1000G Ph1 Accsbl 1000G Ph1 Vars
"tgpPhase1Accessibility" "tgpPhase1"
1000G Ph3 Accsbl 1000G Ph3 Vars
"tgpPhase3Accessibility" "tgpPhase3"
All SNPs(138) All SNPs(141)
"snp138" "snp141"
All SNPs(142) All SNPs(144)
"snp142" "snp144"
All SNPs(146) Common SNPs(138)
"snp146" "snp138Common"
Common SNPs(141) Common SNPs(142)
"snp141Common" "snp142Common"
Common SNPs(144) DGV Struct Var
"snp144Common" "dgvPlus"
EVS Variants ExAC
"evsEsp6500" "exac"
Flagged SNPs(138) Flagged SNPs(141)
"snp138Flagged" "snp141Flagged"
Flagged SNPs(142) Flagged SNPs(144)
"snp142Flagged" "snp144Flagged"
Flagged SNPs(146) Genome Variants
"snp146Flagged" "pgSnp"
GIS DNA PET HAIB Genotype
"wgEncodeGisDnaPet" "wgEncodeHaibGenotype"
HapMap SNPs HGDP Allele Freq
"hapmapSnps" "hgdpGeo"
Mult. SNPs(138) Mult. SNPs(142)
"snp138Mult" "snp142Mult"
Mult. SNPs(144) Mult. SNPs(146)
"snp144Mult" "snp146Mult"
NumtS Sequence SNP/CNV Arrays
"numtSeq" "genotypeArrays"
RepeatMasker Interrupted Rpts
"rmsk" "nestedRepeats"
Microsatellite Segmental Dups
"microsat" "genomicSuperDups"
Self Chain Simple Repeats
"chainSelf" "simpleRepeat"
query <- ucscTableQuery(session, "Ensembl Genes",
GRangesForUCSCGenome("hg19", "chr7",
IRanges(26591341,27034958)))
tableNames(query)
[1] "ensGene" "ccdsInfo" "ensGtp"
[4] "ensPep" "ensemblSource" "ensemblToGeneName"
[7] "knownToEnsembl"
query <- ucscTableQuery(session, "Ensembl Genes",
GRangesForUCSCGenome("hg19", "chr7",
IRanges(26591341,27034958)))
tableNames(query)
[1] "ensGene" "ccdsInfo" "ensGtp"
[4] "ensPep" "ensemblSource" "ensemblToGeneName"
[7] "knownToEnsembl"
ucscTrack <- UcscTrack(genome = "hg19",
chromosome = "chr7",
track = "ensGene",
from = 26591341,
to = 27034958,
trackType = "GeneRegionTrack",
rstarts = "exonStarts",
rends = "exonEnds",
gene ="name",
symbol = "name2",
transcript = "name",
strand = "strand"
)
To build the UCSC annotation as a GeneRegionTrack we must specify some information specific to GeneRegionTrack objects. This includes the "rstarts" and "rends". You can consult the help for GeneRegionTrack() (?GeneRegionTrack to see from in R) to see full parameters required for UcscTrack objects.
Now we can compare the Ensembl gene builds from the two different sources.
Notable differences in the annotation include the absense of some transcipts due to the additional filter applied in our BiomartGeneRegionTrack object creation.
plotTracks(c(bgrTrack,ucscTrack),
from = 26591341,to = 27034958)
By the same method we can take advantage of other types of UCSC data.
In this example we capture the Conservation in the phyloP100wayAll table over and around our previously investigated ChIP-seq reads peak.
Here we specify the data to be returned as a DataTrack object and the display type to be "hist". Here we are creating a DataTrack so would consult DataTrack() help (?DataTrack) to get full parameter list.
conservationTrack <- UcscTrack(genome = "hg19", chromosome = "chr5",track = "Conservation", table = "phyloP100wayAll",from = 135313003, to = 135313570, trackType = "DataTrack",start = "start", end = "end", data = "score",type = "hist", window = "auto", col.histogram = "darkblue",fill.histogram = "darkblue", ylim = c(-3.7, 4),name = "Conservation")
With the inclusion of conservation alongside the coverage from CTCF peaks we can see a spike in conservation around the CTCF peak summit. We include a relative scale and increase the size of text for completeness.
plotTracks(c(conservationTrack,peakReads,genomeAxis),
from=135313003,
to=135313570,
chromosome = "chr5",
type = c("hist","coverage"),
sizes = c(1,1,0.2),
cex=2)
Time for exercises! Link here
Time for solutions! Link here