Skip to content

Results of the GENCODE pipeline to produce a collection of full length high quality transcripts

Notifications You must be signed in to change notification settings

guigolab/gencode-cls-master-table

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

Gencode - CLS Master Table

Here we specifically report on the results of the GENCODE pipeline to produce a collection of full length high quality transcripts.

Table of Contents

Background

We employed CLS to target genomic regions with apparently weak, but potentially relevant, transcriptional activity. These include, among others, regions predicted to encode lncRNAs, enhancers, precursors of small RNAs, RNAs predicted to contain structural motifs, host non-coding GWAS hits or regions showing evolutionary characteristics of protein coding gene, or that are evolutionary conserved. Probes were designed in the human genome (assembly version hg38) using gencode v27 as reference annotation, and in mouse (assembly version mm10) with gencode vM16 annotation, finally RNA has been captured in multiple matched adult and embryonic tissues in both organisms. All long RNAseq reads have been processed using LyRic, employing short read RNAseq data to support long reads derived transcript models.

Data

Tables

Supplementary Tables

Enhanced Annotation

The enhanced annotation has been obtained by adding to specific Gencode genesets the spliced, intergenic, non-artefactual CLS models.

  • Human v43
  • Mouse [vM16](coming soon)

Attributes specifics

Each feature in the table is associated to a gene_id and transcript_id attributes, specifying the unique identifier as generated by LyRic. The transcript features are additionally endowed with the following attributes:

1 target comma-separated list of the genomic regions targeted by the pipeline. For each target we report, in order, the source database, the identifier, the chromosome, start and end coordinates, and the strand.
2 endSupport a value among polyAOnlySupported, cageOnlySupported, cagePolyASupported, noCageNoPolyASupported, indicating the type of support available for the transcript model.
3 spliced either spliced or unspliced, indicating whether the associated transcript model is composed of different exons or remains unspliced.
4 refCompare result of gffcompare against Gencode annotation v27 for human and vM16 for mouse. The original codes have been further collapsed to obtain the following categories: Antisense (corresponding gffcompare codes s, x), Equal (=), Extends (k), Included (c), Intergenic (u), Intronic (i), revIntronic (y), Overlaps (j, e, m, o, n), runOn (p)
5 currentCompare result of gffcompare against Gencode annotation v43 for human and vM31 for mouse, same categories as above. 
6 sampleN integer value indicating the number of transcripts across all samples merged into the respective transcript model
7 samplesMetadata a list of mnemotechnics samples IDs the collapsed transcripts were encountered in. See Samples Metadata
8 expression decimal value corresponding to the expression level of the transcript, expressed as RPM. The order of the values matches the order of the samples the transcript belongs to in the previous tag.
9 artifact list of tags reflecting whether the model is deemed artefactual and why. See Artefacts Tags

Samples Metadata

The samples IDs have been generated in a way to keep track of as many metadata as possible. The names are composed as follows:

0 Fixed prefix SID
1 Single letter code for the organism H (human), M (mouse)
2 Two letter code indicating the tissue See Tissue codes
3 Single letter code for the stage A (adult), E (embryo), P (placenta)
4 Single letter code for the sequencing technology O (ont), P (PacBio)
5 Single letter code for capture status P (pre-capture), C (post-capture)
6 Two digit code for the biological replicate -
7 Two digit code for the technical replicate -

Artefacts Tags

0 no Genuine model.
1 oppStrandMismap i.e., Opposite strand mismapping. Models mapping on opposite strand of annotated coding loci.
2 polyASJdisag i.e., PolyA - Splice Junction disagreement. Highlights possible problems during mapping steps, where polyA and splice junction provide contradictory information over the read strand.
3 pseudogeneOverlap Models contained within an annotated pseudogene locus. This are supposedly generated from parent gene but wrongly assigned due to the polyA stretch at 3'.
4 recountSlt50 Models in which any of the splice junctions doesn't meet the minimum threshold of recount support (50).
5 spliceSiteMisalign i.e., Splice site misalignments. Highlights uncertainty in splice junction placement at mapping.
6 tRepeatOverlap i.e., Tandem repeat overlapping.

Tissue Codes

Brain Br
Heart He
Liver Li
WBCs Wb
ESC Wb
iPSC Wb
Testis Te
Placenta Pl
Tpool Tp
Cpool Cp

Some examples of aliases and their meaning

  • SIDMBrEPP0101: mouse brain embryo PacBio pre-capture biologicalReplicate01 technicalReplicate01
  • SIDHWbAOC0103: human whiteBlood adult ont post-capture biologicalReplicate01 technicalReplicate03
  • SIDHPlPPP0202: human placenta placenta PacBio pre-capture biologicalReplicate02 technicalReplicate02

Quickstart

The following script is readily available to extract the tags from the GTF file.

About

Results of the GENCODE pipeline to produce a collection of full length high quality transcripts

Resources

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •