Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nDCG of the fine-tuned MonoT5 models diverge from paper #1

Open
philipphager opened this issue Jun 28, 2023 · 2 comments
Open

nDCG of the fine-tuned MonoT5 models diverge from paper #1

philipphager opened this issue Jun 28, 2023 · 2 comments

Comments

@philipphager
Copy link

philipphager commented Jun 28, 2023

Hey all,

First of all thank you for this interesting work (I enjoyed reading the paper a lot)! After cloning and executing the project run_monot5.py we obtained the following results from the cached run files monoT5/runs:

name P@1 P@5 P@10 nDCG@10 nDCG@20 RR AP
MonoT5 fine-tuned title+url 0.8412 0.5991 0.3914 0.6858 0.7087 0.9025 0.7396
MonoT5 fine-tuned title+url+text 0.8581 0.5945 0.3910 0.7034 0.7268 0.9132 0.7462

Which are notably higher values in terms of nDCG than the values reported in the paper (which are ≈0.45). A student of mine also re-ran the T5 models published on huggingface without the run caching and reported similarly diverging values.

I just wanted to highlight this finding. Do you have any idea where these values are coming form?

Cheers,
Philipp

@philipphager philipphager changed the title nDCG for the fine-tuned M5 models diverge from paper nDCG of the fine-tuned MonoT5 models diverge from paper Jun 28, 2023
@seanmacavaney
Copy link
Collaborator

Hey @philipphager -- thanks for reporting!

It looks like the nDCG results reported in the paper used an old version of the qrels file that mistakenly included duplicate entries. trec_eval (which provides P@k, MAP, etc.) handles these properly, but gdeval (which provides nDCG(dcg="exp-log2")@20) doesn't and counts the duplicates against the ideal DCG. When I run evaluation of MonoT5 using the official qrels and the result files included in this repository, my results match the ones you list above.

This issue did not affect the 𝜆-Mart results in the paper, since they were all computed using the correct qrels file.

From what I can tell, this problem doesn't change the conclusions in the paper, since the nDCG's of MonoT5 are still a cut below the 𝜆-Mart results. I'll try to prepare a corrected version of Table 4 that we can put in this repository.

Does this help?

@philipphager
Copy link
Author

Hey @seanmacavaney!

Thanks for the quick follow-up! Hmm, that makes sense. Agreed, the conclusions do not change just the intuition of how far the two methods are apart from each other. So it'd definitely make sense to publish and updated table on the repository 👍

Thanks again for all the help, cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants