You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all thank you for this interesting work (I enjoyed reading the paper a lot)! After cloning and executing the project run_monot5.py we obtained the following results from the cached run files monoT5/runs:
name
P@1
P@5
P@10
nDCG@10
nDCG@20
RR
AP
MonoT5 fine-tuned title+url
0.8412
0.5991
0.3914
0.6858
0.7087
0.9025
0.7396
MonoT5 fine-tuned title+url+text
0.8581
0.5945
0.3910
0.7034
0.7268
0.9132
0.7462
Which are notably higher values in terms of nDCG than the values reported in the paper (which are ≈0.45). A student of mine also re-ran the T5 models published on huggingface without the run caching and reported similarly diverging values.
I just wanted to highlight this finding. Do you have any idea where these values are coming form?
Cheers,
Philipp
The text was updated successfully, but these errors were encountered:
philipphager
changed the title
nDCG for the fine-tuned M5 models diverge from paper
nDCG of the fine-tuned MonoT5 models diverge from paper
Jun 28, 2023
It looks like the nDCG results reported in the paper used an old version of the qrels file that mistakenly included duplicate entries. trec_eval (which provides P@k, MAP, etc.) handles these properly, but gdeval (which provides nDCG(dcg="exp-log2")@20) doesn't and counts the duplicates against the ideal DCG. When I run evaluation of MonoT5 using the official qrels and the result files included in this repository, my results match the ones you list above.
This issue did not affect the 𝜆-Mart results in the paper, since they were all computed using the correct qrels file.
From what I can tell, this problem doesn't change the conclusions in the paper, since the nDCG's of MonoT5 are still a cut below the 𝜆-Mart results. I'll try to prepare a corrected version of Table 4 that we can put in this repository.
Thanks for the quick follow-up! Hmm, that makes sense. Agreed, the conclusions do not change just the intuition of how far the two methods are apart from each other. So it'd definitely make sense to publish and updated table on the repository 👍
Hey all,
First of all thank you for this interesting work (I enjoyed reading the paper a lot)! After cloning and executing the project
run_monot5.py
we obtained the following results from the cached run filesmonoT5/runs
:Which are notably higher values in terms of nDCG than the values reported in the paper (which are ≈0.45). A student of mine also re-ran the T5 models published on huggingface without the run caching and reported similarly diverging values.
I just wanted to highlight this finding. Do you have any idea where these values are coming form?
Cheers,
Philipp
The text was updated successfully, but these errors were encountered: