Gencode - CLS Master Table

Here we specifically report on the results of the GENCODE pipeline to produce a collection of full length high quality transcripts.

Table of Contents

Background
Data
Attributes specifics
Quickstart

Background

We employed CLS to target genomic regions with apparently weak, but potentially relevant, transcriptional activity. These include, among others, regions predicted to encode lncRNAs, enhancers, precursors of small RNAs, RNAs predicted to contain structural motifs, host non-coding GWAS hits or regions showing evolutionary characteristics of protein coding gene, or that are evolutionary conserved. Probes were designed in the human genome (assembly version hg38) using gencode v27 as reference annotation, and in mouse (assembly version mm10) with gencode vM16 annotation, finally RNA has been captured in multiple matched adult and embryonic tissues in both organisms. All long RNAseq reads have been processed using LyRic, employing short read RNAseq data to support long reads derived transcript models.

Data

Tables

Supplementary Tables

Human
- Target design
- Transcript Models Sequences
Mouse
- Target design
- Transcript Models Sequences

Enhanced Annotation

The enhanced annotation has been obtained by adding to specific Gencode genesets the spliced, intergenic, non-artefactual CLS models.

Human v43
Mouse [vM16](coming soon)

Attributes specifics

Each feature in the table is associated to a gene_id and transcript_id attributes, specifying the unique identifier as generated by LyRic. The transcript features are additionally endowed with the following attributes:


1	target	comma-separated list of the genomic regions targeted by the pipeline. For each target we report, in order, the source database, the identifier, the chromosome, start and end coordinates, and the strand.
2	endSupport	a value among polyAOnlySupported, cageOnlySupported, cagePolyASupported, noCageNoPolyASupported, indicating the type of support available for the transcript model.
3	spliced	either spliced or unspliced, indicating whether the associated transcript model is composed of different exons or remains unspliced.
4	refCompare	result of gffcompare against Gencode annotation v27 for human and vM16 for mouse. The original codes have been further collapsed to obtain the following categories: Antisense (corresponding gffcompare codes s, x), Equal (=), Extends (k), Included (c), Intergenic (u), Intronic (i), revIntronic (y), Overlaps (j, e, m, o, n), runOn (p)
5	currentCompare	result of gffcompare against Gencode annotation v43 for human and vM31 for mouse, same categories as above.
6	sampleN	integer value indicating the number of transcripts across all samples merged into the respective transcript model
7	samplesMetadata	a list of mnemotechnics samples IDs the collapsed transcripts were encountered in. See Samples Metadata
8	expression	decimal value corresponding to the expression level of the transcript, expressed as RPM. The order of the values matches the order of the samples the transcript belongs to in the previous tag.
9	artifact	list of tags reflecting whether the model is deemed artefactual and why. See Artefacts Tags

Samples Metadata

The samples IDs have been generated in a way to keep track of as many metadata as possible. The names are composed as follows:


0	Fixed prefix	SID
1	Single letter code for the organism	`H` (human), `M` (mouse)
2	Two letter code indicating the tissue	See Tissue codes
3	Single letter code for the stage	`A` (adult), `E` (embryo), `P` (placenta)
4	Single letter code for the sequencing technology	`O` (ont), `P` (PacBio)
5	Single letter code for capture status	`P` (pre-capture), `C` (post-capture)
6	Two digit code for the biological replicate	-
7	Two digit code for the technical replicate	-

Artefacts Tags


0	no	Genuine model.
1	oppStrandMismap	i.e., Opposite strand mismapping. Models mapping on opposite strand of annotated coding loci.
2	polyASJdisag	i.e., PolyA - Splice Junction disagreement. Highlights possible problems during mapping steps, where polyA and splice junction provide contradictory information over the read strand.
3	pseudogeneOverlap	Models contained within an annotated pseudogene locus. This are supposedly generated from parent gene but wrongly assigned due to the polyA stretch at 3'.
4	recountSlt50	Models in which any of the splice junctions doesn't meet the minimum threshold of recount support (50).
5	spliceSiteMisalign	i.e., Splice site misalignments. Highlights uncertainty in splice junction placement at mapping.
6	tRepeatOverlap	i.e., Tandem repeat overlapping.

Tissue Codes


Brain	`Br`
Heart	`He`
Liver	`Li`
WBCs	`Wb`
ESC	`Wb`
iPSC	`Wb`
Testis	`Te`
Placenta	`Pl`
Tpool	`Tp`
Cpool	`Cp`

Some examples of aliases and their meaning

SIDMBrEPP0101: mouse brain embryo PacBio pre-capture biologicalReplicate01 technicalReplicate01
SIDHWbAOC0103: human whiteBlood adult ont post-capture biologicalReplicate01 technicalReplicate03
SIDHPlPPP0202: human placenta placenta PacBio pre-capture biologicalReplicate02 technicalReplicate02

Quickstart

The following script is readily available to extract the tags from the GTF file.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gencode - CLS Master Table

Background

Data

Tables

Supplementary Tables

Enhanced Annotation

Attributes specifics

Samples Metadata

Artefacts Tags

Tissue Codes

Quickstart

About

Releases 7

Contributors 3

guigolab/gencode-cls-master-table

Folders and files

Latest commit

History

Repository files navigation

Gencode - CLS Master Table

Background

Data

Tables

Supplementary Tables

Enhanced Annotation

Attributes specifics

Samples Metadata

Artefacts Tags

Tissue Codes

Quickstart

About

Resources

Stars

Watchers

Forks

Releases 7

Contributors 3