Generate a metric oddities file based on the mikado mikado.loci.metrics.tsv, mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv #28

swarbred · 2020-06-06T10:38:53Z

Mikado pick generates a metrics file for the final models (mikado.loci.metrics.tsv) and for the input models (mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv). The mono and sunbloci files describe spliced and single exon models respectively. NOTE we are currently not generating the monosubloci file so this option just needs to be added to the pick command that is run.

It would be useful to create a file that gives counts of transcripts with metrics that might suggest a problematic/incorrectly annotated gene model) i.e. biologically unusual or lack evidence support for junctions (based on the portcullis results)

Below are the metrics which I feel would be useful to extract and summarise

Oddities (derived from the mikado mono loci, sunbloci and loci metrics file), provide count of transcripts with

five_utr_length >=10000
five_utr_num >=5
three_utr_length >=10000
three_utr_num >=4
is_complete = False
has_start_codon = False
has_stop_codon = False
max_exon_length = >=10000
max_intron_length >=500000
min_exon_length <=5
min_intron_length <=5
selected_cds_fraction <=0.3
canonical_intron_proportion != 1
non_verified_introns_num >=1
only_non_canonical_splicing = False
proportion_verified_introns <=0.5
suspicious_splicing = True

This would be for the final set of models using mikado.loci.metrics.tsv (note this file will contain some models we have excluded from the final gene set through the classification so we should exclude those models when determining these counts.

From mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv) we would need to break this down based on label (similar to for the busco results) so that we can generate a table with the above rows and the columns as the different gene sets (i.e. final and input gene sets).

The values chosen for the metrics should highlight potential issues (though there will be genuine exceptions), intron size is variable between species and for example there will be genuine mammalian introns over 500000bp but you still expect these to be small in number and for many species these will be artefacts.

cschu · 2020-06-09T11:30:47Z

The mono and sunbloci files

I like the Freudian slip here. Longing for sunny vacation? :D

cschu · 2020-06-09T11:43:56Z

@swarbred
only_non_canonical_splicing = False, shouldn't this be True? (since we're reporting on oddities?)

swarbred · 2020-06-09T11:53:12Z

shouldn't this be True? (since we're reporting on oddities?)
Correct

cschu · 2020-06-09T16:47:46Z

Will the mono/subloci also be filtered by presence in the release set?

99[p;'?>le

…sults folder (issue #28)

cschu · 2020-06-09T16:59:22Z

Feature is available in minos-1.6, three tables will be produced in the results folder mikado:{loci,monoloci,subloci}.metrics_oddities.tsv (082d711)

swarbred · 2020-06-09T17:13:15Z

Will the mono/subloci also be filtered by presence in the release set?

No this should be all models irrespectively if they were picked and then classified. These are the stats of the input models to pick, so all being good (and for some of the metrics this is bound to be the case) they should look "better" in the final selected gene set.

99[p;'?>le

~`66]{"%@£

swarbred · 2020-06-09T17:29:09Z

Feature is available in minos-1.6, three tables will be produced in the results folder mikado:{loci,monoloci,subloci}.metrics_oddities.tsv

I'm going to be a pain as what we really want here is to have this as one file so these can be easily compared.

The mono and sub loci genes can be brought together (these are just the spliced and single exon subsets of the input models)

i.e. based on this run /ei/workarea/group-pb/CB-PPBFX-811_Annotation_of_Lathyrus_sativus/Analysis/gmc/mikado-2.0rc6_d094f99_CBG/GMC-1.3_run1/results

Col1 (Metric) Col2 (LATSA3860_EIv1), Col3 (LATSA3860_run1_wRNA), Col4 (LATSA3860_run2_woRNA), Col5 (LATSA3860_run3_woRNA), Col6 (mikado_transcript_run), Col7 (mikado_protein_run)

Col 1 the metrics
Col 2 derived from loci, with the additional filtering, so that this gives the counts fro the final set of selected models
Col 3-7 are the labels for the input models, for each label you can combine the counts from the mono and subloci, so we have the total for each input set.

swarbred · 2020-06-14T08:23:44Z

@cschu
The 1.6 run completed, I've done some spot / sanity checks and the output looks correct.

For the oddities file I think we need to break this down further (for the final Minos models) so that we provide numbers for the high confidence genes as well as the full set of final Minos models. Adding this will help us when we write up the the koala use case as the most appropriate comparison will be between the high confidence Minos models and the other gene sets (currently these numbers look worse for Minos as we have a large number of low confidence models that derive from the transcript assemblies)

We have 4 distinct biotypes and two confidence classifcations, we don't need to break this down into each combination, what most people are interested in are the full set of models (i.e. what we have now) and just the high confidence protein coding (biotype).

Could we add an additional column giving the counts for "protein_coding_gene High" ?

metric | final_set | final_set (protein_coding_gene, High) |

swarbred · 2020-06-15T21:28:42Z

@cschu

only_non_canonical_splicing = False, shouldn't this be True? (since we're reporting on oddities?)

Correct

looks like this hasn't been changed and is counting the number of transcripts marked as false rather than true, in the oddities.tsv it's labeled as not only_non_canonical_splicing should be the count of true and labelled as only_non_canonical_splicing

cschu · 2020-06-15T21:50:21Z

Argh. This can be controlled in the config.yaml, just remove the not from "not {only_non_canonical_splicing}" in the report_metric_oddities. I will change that in the default config.yaml.

cschu · 2020-06-15T22:00:20Z

Addressed in c740516

swarbred added the enhancement New feature or request label Jun 6, 2020

swarbred assigned cschu Jun 6, 2020

cschu added a commit that referenced this issue Jun 9, 2020

Counts of transcripts with metric oddities are now reported in the re…

082d711

…sults folder (issue #28)

cschu added a commit that referenced this issue Jun 9, 2020

Metric oddities are now reported in a combined table (issue #28)

17a0a0f

cschu added a commit that referenced this issue Jun 14, 2020

Metrics oddities now includes hi_conf_protein_coding column (#28)

154ebed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate a metric oddities file based on the mikado mikado.loci.metrics.tsv, mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv #28

Generate a metric oddities file based on the mikado mikado.loci.metrics.tsv, mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv #28

swarbred commented Jun 6, 2020 •

edited

Loading

cschu commented Jun 9, 2020

cschu commented Jun 9, 2020

swarbred commented Jun 9, 2020

cschu commented Jun 9, 2020

cschu commented Jun 9, 2020

swarbred commented Jun 9, 2020

swarbred commented Jun 9, 2020 •

edited

Loading

swarbred commented Jun 14, 2020

swarbred commented Jun 15, 2020

cschu commented Jun 15, 2020

cschu commented Jun 15, 2020

Generate a metric oddities file based on the mikado mikado.loci.metrics.tsv, mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv #28

Generate a metric oddities file based on the mikado mikado.loci.metrics.tsv, mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv #28

Comments

swarbred commented Jun 6, 2020 • edited Loading

cschu commented Jun 9, 2020

cschu commented Jun 9, 2020

swarbred commented Jun 9, 2020

cschu commented Jun 9, 2020

cschu commented Jun 9, 2020

swarbred commented Jun 9, 2020

swarbred commented Jun 9, 2020 • edited Loading

swarbred commented Jun 14, 2020

swarbred commented Jun 15, 2020

cschu commented Jun 15, 2020

cschu commented Jun 15, 2020

swarbred commented Jun 6, 2020 •

edited

Loading

swarbred commented Jun 9, 2020 •

edited

Loading