Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a metric oddities file based on the mikado mikado.loci.metrics.tsv, mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv #28

Open
swarbred opened this issue Jun 6, 2020 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@swarbred
Copy link

swarbred commented Jun 6, 2020

Mikado pick generates a metrics file for the final models (mikado.loci.metrics.tsv) and for the input models (mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv). The mono and sunbloci files describe spliced and single exon models respectively. NOTE we are currently not generating the monosubloci file so this option just needs to be added to the pick command that is run.

It would be useful to create a file that gives counts of transcripts with metrics that might suggest a problematic/incorrectly annotated gene model) i.e. biologically unusual or lack evidence support for junctions (based on the portcullis results)

Below are the metrics which I feel would be useful to extract and summarise

Oddities (derived from the mikado mono loci, sunbloci and loci metrics file), provide count of transcripts with

five_utr_length >=10000
five_utr_num >=5
three_utr_length >=10000
three_utr_num >=4
is_complete = False
has_start_codon = False
has_stop_codon = False
max_exon_length = >=10000
max_intron_length >=500000
min_exon_length <=5
min_intron_length <=5
selected_cds_fraction <=0.3
canonical_intron_proportion != 1
non_verified_introns_num >=1
only_non_canonical_splicing = False
proportion_verified_introns <=0.5
suspicious_splicing = True

This would be for the final set of models using mikado.loci.metrics.tsv (note this file will contain some models we have excluded from the final gene set through the classification so we should exclude those models when determining these counts.

From mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv) we would need to break this down based on label (similar to for the busco results) so that we can generate a table with the above rows and the columns as the different gene sets (i.e. final and input gene sets).

The values chosen for the metrics should highlight potential issues (though there will be genuine exceptions), intron size is variable between species and for example there will be genuine mammalian introns over 500000bp but you still expect these to be small in number and for many species these will be artefacts.

@swarbred swarbred added the enhancement New feature or request label Jun 6, 2020
@cschu
Copy link
Collaborator

cschu commented Jun 9, 2020

The mono and sunbloci files

I like the Freudian slip here. Longing for sunny vacation? :D

@cschu
Copy link
Collaborator

cschu commented Jun 9, 2020

@swarbred
only_non_canonical_splicing = False, shouldn't this be True? (since we're reporting on oddities?)

@swarbred
Copy link
Author

swarbred commented Jun 9, 2020

shouldn't this be True? (since we're reporting on oddities?)
Correct

@cschu
Copy link
Collaborator

cschu commented Jun 9, 2020

Will the mono/subloci also be filtered by presence in the release set?

99[p;'?>le

cschu added a commit that referenced this issue Jun 9, 2020
@cschu
Copy link
Collaborator

cschu commented Jun 9, 2020

Feature is available in minos-1.6, three tables will be produced in the results folder mikado:{loci,monoloci,subloci}.metrics_oddities.tsv (082d711)

@swarbred
Copy link
Author

swarbred commented Jun 9, 2020

Will the mono/subloci also be filtered by presence in the release set?

No this should be all models irrespectively if they were picked and then classified. These are the stats of the input models to pick, so all being good (and for some of the metrics this is bound to be the case) they should look "better" in the final selected gene set.

99[p;'?>le

~`66]{"%@£

@swarbred
Copy link
Author

swarbred commented Jun 9, 2020

Feature is available in minos-1.6, three tables will be produced in the results folder mikado:{loci,monoloci,subloci}.metrics_oddities.tsv

I'm going to be a pain as what we really want here is to have this as one file so these can be easily compared.

The mono and sub loci genes can be brought together (these are just the spliced and single exon subsets of the input models)

i.e. based on this run /ei/workarea/group-pb/CB-PPBFX-811_Annotation_of_Lathyrus_sativus/Analysis/gmc/mikado-2.0rc6_d094f99_CBG/GMC-1.3_run1/results

Col1 (Metric) Col2 (LATSA3860_EIv1), Col3 (LATSA3860_run1_wRNA), Col4 (LATSA3860_run2_woRNA), Col5 (LATSA3860_run3_woRNA), Col6 (mikado_transcript_run), Col7 (mikado_protein_run)

Col 1 the metrics
Col 2 derived from loci, with the additional filtering, so that this gives the counts fro the final set of selected models
Col 3-7 are the labels for the input models, for each label you can combine the counts from the mono and subloci, so we have the total for each input set.

@swarbred
Copy link
Author

@cschu
The 1.6 run completed, I've done some spot / sanity checks and the output looks correct.

For the oddities file I think we need to break this down further (for the final Minos models) so that we provide numbers for the high confidence genes as well as the full set of final Minos models. Adding this will help us when we write up the the koala use case as the most appropriate comparison will be between the high confidence Minos models and the other gene sets (currently these numbers look worse for Minos as we have a large number of low confidence models that derive from the transcript assemblies)

We have 4 distinct biotypes and two confidence classifcations, we don't need to break this down into each combination, what most people are interested in are the full set of models (i.e. what we have now) and just the high confidence protein coding (biotype).

Could we add an additional column giving the counts for "protein_coding_gene High" ?

metric | final_set | final_set (protein_coding_gene, High) |

@swarbred
Copy link
Author

@cschu

only_non_canonical_splicing = False, shouldn't this be True? (since we're reporting on oddities?)

Correct

looks like this hasn't been changed and is counting the number of transcripts marked as false rather than true, in the oddities.tsv it's labeled as not only_non_canonical_splicing should be the count of true and labelled as only_non_canonical_splicing

@cschu
Copy link
Collaborator

cschu commented Jun 15, 2020

Argh. This can be controlled in the config.yaml, just remove the not from "not {only_non_canonical_splicing}" in the report_metric_oddities. I will change that in the default config.yaml.

@cschu
Copy link
Collaborator

cschu commented Jun 15, 2020

Addressed in c740516

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants