Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On the inconsistency of Taxid and BERTax taxonomy labels and the calculation of evaluation metrics for AveP. #14

Open
NickShanyt opened this issue Dec 7, 2023 · 8 comments

Comments

@NickShanyt
Copy link

Hi!
I'm interested in your work and I'm trying to reproduce the results on the data you released, but I'm having some problems.

1, The released sequence data contains taxid, and I used NCBI to map these taxids into taxonomic classification, and I got the corresponding taxonomic level for each sequence. However, many of these taxonomic labels obtained cannot correspond to those labels in the BERTax model(5 superkingdom,44 phylum,156 genus), and some of them I have corrected manually.

Although I have done the correction in the final dataset, the genus level correction is a bit difficult in similar dataset and non-similar dataset. I would like to ask, is this an objective problem right? Is there any possible solution?

2, I would also like to ask if the Accuracy and AveP metrics mentioned in the paper are accuracy and precision as we know them? Use from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score is it possible to calculate the same metrics mentioned in the paper?

Thank you for your work.

@f-kretschmer
Copy link
Collaborator

Sorry for the late answer.

  1. Since we did our evaluations, the NCBI taxonomy has likely had some changes. Here is a taxdump for the version we used: https://upload.uni-jena.de/data/656deff28d9cd2.73093822/taxdump.tar.gz. It can be used with ete3 (https://github.com/f-kretschmer/bertax_training/blob/master/utils/tax_entry.py).
  2. The average precision was calculated based on micro average Precision-Recall-curves (sklearn.metrics.average_precision_score). For the accuracy, we used a balanced version due to unbalanced data: taking the mean over all superkingdom classes, as described in the paper. Additionally, there are also confusion matrices for everything here: https://github.com/f-kretschmer/bertax/tree/master/confusion_matrices.

Hope this helps!

@f-kretschmer
Copy link
Collaborator

f-kretschmer commented Dec 20, 2023

I'm sorry, I think the taxdump.tar.gz is the incorrect version, this must be the correct one: https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_archive/new_taxdump_2021-04-01.zip

@yongrenr
Copy link

Hello!
I'm interested in your work and I'm trying to reproduce the results on the data you released, but I'm having some problems.

"The average precision was calculated based on micro average Precision-Recall-curves (sklearn.metrics.average_precision_score). For the accuracy, we used a balanced version due to unbalanced data: taking the mean over all superkingdom classes, as described in the paper. Additionally, there are also confusion matrices for everything here: https://github.com/f-kretschmer/bertax/tree/master/confusion_matrices."
s
I wonder if this accuracy calculation is only used for superkingdom classes, and is it used in phylum classes and genus classes?

@f-kretschmer
Copy link
Collaborator

Hi!

Both the balanced accuracy calculation (sklearn.metrics.balanced_accuracy_score) and average precision calculation (sklearn.metrics.precision_score) is used for all ranks.

@yongrenr
Copy link

Hi!

Both the balanced accuracy calculation (sklearn.metrics.balanced_accuracy_score) and average precision calculation (sklearn.metrics.precision_score) is used for all ranks.
Thank you very much for your prompt reply!!!!
I'm curious about what kind of metrics are used in your PNAs paper?Thank u!!!!
image

@f-kretschmer
Copy link
Collaborator

In this table it is Average Precision (AveP), but we also have Precision-Recall-plots, ROC-curves and balanced accuracy.

@yongrenr
Copy link

In this table it is Average Precision (AveP), but we also have Precision-Recall-plots, ROC-curves and balanced accuracy.

So comprehensive!!
I have one more small question.On the Closely and Distantly datasets, the performance of the phyl is average. But why do gates work so well in the Final dataset? I'd like to ask if you have done anything else other than changing the number of attention heads.
Thank you very much!!!

@f-kretschmer
Copy link
Collaborator

The "final" dataset has a lot more data and also an additional output layer for "genus" prediction. Everything is detailed in the section "Performance of Final BERTax Model" in the PNAS Paper. See especially SFig. 2, which has a visualization trying to show why adding the genus layer leads to better performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants