Skip to content

Latest commit

 

History

History
142 lines (121 loc) · 5.12 KB

HowToMakeGencodeTxDbPkg.md

File metadata and controls

142 lines (121 loc) · 5.12 KB

How to Make a TxDb for Genome Build hg38

The aim of this document is to show how to make Bioconductor Transcripts database (TxDb) and package for Gencode track. Generally there are many ways to do it, and here I like to show how to extrack the current Gencode track from UCSC genome browser, which might be tricky, especially when Bionconductor

Build Gencode TxDb From UCSC Genome Brower

Here, I would like to show how to build a hg 38 TxDb annotation package from Gencode v24 track available on UCSC Genome Browser. Let's start with loading two major libraries and open a browser session on R workspace.

library(rtracklayer)
library(GenomicFeatures)
session <- browserSession()
genome(session) <- "hg38"

To check the track names and table names of Gencode on UCSC Genome Browser, you can do

rtracklayer::trackNames(session)

Currently on on UCSC Genome Brower (9/26/2016), the track name for Gencode is Gencdoe v24 and the table name is knownGene. If you have tried

txdb <- makeTxDbFromUCSC(genome="hg38", tablename="knownGene")

and if it works, please ignore the rest of this document.

If it doesn't work, that means there are some errors inhibit in the GenomicFeatures pakcage. You will need to take a detour and can use the following code as an example.

tablename <- "knownGene"
track <- "GENCODE v24"
ucsc_txtable <-
    getTable(ucscTableQuery(session, track=track, table=tablename))
mapdef <- GenomicFeatures:::.howToGetTxName2GeneIdMapping(tablename)
txname2geneid <- GenomicFeatures:::.fetchTxName2GeneIdMappingFromUCSC(session, 
           track, tablename, mapdef)
transcript_ids = NULL
txdb <- GenomicFeatures:::.makeTxDbFromUCSCTxTable(ucsc_txtable,
        txname2geneid$genes, 
        genome="hg38", tablename, track, txname2geneid$gene_id_type, 
        full_dataset = is.null(transcript_ids), circ_seqs = DEFAULT_CIRC_SEQS,
        taxonomyId = NA, miRBaseBuild = NA)

The TxDb instance looks like

txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg38
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # UCSC Track: GENCODE v24
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: NA
## # transcript_nrow: 197782
## # exon_nrow: 581036
## # cds_nrow: 293052
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2016-09-28 12:09:44 -0700 (Wed, 28 Sep 2016)
## # GenomicFeatures version at creation time: 1.24.5
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1

Once you have a TxDb instance, you can build a TxDb package. Below is an example:

makeTxDbPackage(txdb, version = "1.24.0",
                maintainer = "Chao-Jen Wong <cwon2@fredhutch.org>",
                author = "Chao-Jen Wong", destDir = "~/tapscott/hg38")

Ensembl From BioMart

Want to use the most recent annotation on Ensembl? See example below:

library(biomaRt)
library(GenomicFeatures)
listMarts(host="www.ensembl.org")
datasets <- listDatasets(biomaRt::useMart(
    biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org"))
ensembl.txdb <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",
                                    dataset="hsapiens_gene_ensembl",
                                    host="www.ensembl.org")
makeTxDbPackage(ensembl.txdb, version="1.85.0", author="Chao-Jen Wong",
                maintainer="Chao-Jen Wong <cwon2@fredhutch.org>")

Session Information

sessionInfo()
## R Under development (unstable) (2016-09-25 r71358)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.3 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] GenomicFeatures_1.24.5 AnnotationDbi_1.34.4   Biobase_2.32.0        
##  [4] rtracklayer_1.32.2     GenomicRanges_1.24.2   GenomeInfoDb_1.8.7    
##  [7] IRanges_2.6.1          S4Vectors_0.10.3       BiocGenerics_0.18.0   
## [10] knitr_1.14             BiocInstaller_1.22.3  
## 
## loaded via a namespace (and not attached):
##  [1] XVector_0.12.1             magrittr_1.5              
##  [3] zlibbioc_1.18.0            GenomicAlignments_1.8.4   
##  [5] BiocParallel_1.6.6         stringr_1.1.0             
##  [7] tools_3.4.0                SummarizedExperiment_1.2.3
##  [9] DBI_0.5-1                  formatR_1.4               
## [11] bitops_1.0-6               biomaRt_2.28.0            
## [13] RCurl_1.95-4.8             RSQLite_1.0.0             
## [15] evaluate_0.9               stringi_1.1.1             
## [17] Biostrings_2.40.2          Rsamtools_1.24.0          
## [19] XML_3.98-1.4