-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate a metric oddities file based on the mikado mikado.loci.metrics.tsv, mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv #28
Comments
I like the Freudian slip here. Longing for sunny vacation? :D |
@swarbred |
|
Will the mono/subloci also be filtered by presence in the release set? 99[p;'?>le |
Feature is available in minos-1.6, three tables will be produced in the results folder mikado:{loci,monoloci,subloci}.metrics_oddities.tsv (082d711) |
No this should be all models irrespectively if they were picked and then classified. These are the stats of the input models to pick, so all being good (and for some of the metrics this is bound to be the case) they should look "better" in the final selected gene set.
~`66]{"%@£ |
I'm going to be a pain as what we really want here is to have this as one file so these can be easily compared. The mono and sub loci genes can be brought together (these are just the spliced and single exon subsets of the input models) i.e. based on this run /ei/workarea/group-pb/CB-PPBFX-811_Annotation_of_Lathyrus_sativus/Analysis/gmc/mikado-2.0rc6_d094f99_CBG/GMC-1.3_run1/results Col1 (Metric) Col2 (LATSA3860_EIv1), Col3 (LATSA3860_run1_wRNA), Col4 (LATSA3860_run2_woRNA), Col5 (LATSA3860_run3_woRNA), Col6 (mikado_transcript_run), Col7 (mikado_protein_run) Col 1 the metrics |
@cschu For the oddities file I think we need to break this down further (for the final Minos models) so that we provide numbers for the high confidence genes as well as the full set of final Minos models. Adding this will help us when we write up the the koala use case as the most appropriate comparison will be between the high confidence Minos models and the other gene sets (currently these numbers look worse for Minos as we have a large number of low confidence models that derive from the transcript assemblies) We have 4 distinct biotypes and two confidence classifcations, we don't need to break this down into each combination, what most people are interested in are the full set of models (i.e. what we have now) and just the high confidence protein coding (biotype). Could we add an additional column giving the counts for "protein_coding_gene High" ? metric | final_set | final_set (protein_coding_gene, High) | |
looks like this hasn't been changed and is counting the number of transcripts marked as false rather than true, in the oddities.tsv it's labeled as not only_non_canonical_splicing should be the count of true and labelled as only_non_canonical_splicing |
Argh. This can be controlled in the config.yaml, just remove the not from "not {only_non_canonical_splicing}" in the report_metric_oddities. I will change that in the default config.yaml. |
Addressed in c740516 |
Mikado pick generates a metrics file for the final models (mikado.loci.metrics.tsv) and for the input models (mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv). The mono and sunbloci files describe spliced and single exon models respectively. NOTE we are currently not generating the monosubloci file so this option just needs to be added to the pick command that is run.
It would be useful to create a file that gives counts of transcripts with metrics that might suggest a problematic/incorrectly annotated gene model) i.e. biologically unusual or lack evidence support for junctions (based on the portcullis results)
Below are the metrics which I feel would be useful to extract and summarise
Oddities (derived from the mikado mono loci, sunbloci and loci metrics file), provide count of transcripts with
five_utr_length >=10000
five_utr_num >=5
three_utr_length >=10000
three_utr_num >=4
is_complete = False
has_start_codon = False
has_stop_codon = False
max_exon_length = >=10000
max_intron_length >=500000
min_exon_length <=5
min_intron_length <=5
selected_cds_fraction <=0.3
canonical_intron_proportion != 1
non_verified_introns_num >=1
only_non_canonical_splicing = False
proportion_verified_introns <=0.5
suspicious_splicing = True
This would be for the final set of models using mikado.loci.metrics.tsv (note this file will contain some models we have excluded from the final gene set through the classification so we should exclude those models when determining these counts.
From mikado.subloci.metrics.tsv and mikado.monoloci.metrics.tsv) we would need to break this down based on label (similar to for the busco results) so that we can generate a table with the above rows and the columns as the different gene sets (i.e. final and input gene sets).
The values chosen for the metrics should highlight potential issues (though there will be genuine exceptions), intron size is variable between species and for example there will be genuine mammalian introns over 500000bp but you still expect these to be small in number and for many species these will be artefacts.
The text was updated successfully, but these errors were encountered: