Evaluation accuracy #478

peregilk · 2022-05-05T07:55:31Z

peregilk
May 5, 2022

What is the training_eval-accuracy that is reported in the event logs?

Here is what I am seeing during training on tweet classification task:

The training basically completes after 1000 steps. The reported "training_eval"-accuracy (on a 3-way classification task) is reported around 0.84 (which would have been an amazing score since RoBERTa Large is doing around 0.74 on the same set).

However doing a "real evaluation" on the same dataset (performed after saving the checkpoints in the training script), reveals a totally different situation:

The score of of around 0.68 is not very impressive. I have also looked through the stored predictions here, and calculated accuracy/f1 manually, and can confirm that this metric is correct.

I am trying to figure out why it is doing so bad on this task, and wanted to understand what the training_eval-accuracy really is reporting.

adarob · 2022-05-05T13:56:59Z

adarob
May 5, 2022
Maintainer

There are many reasons this could be the case, but require a lot more details about your setup. Once simple think to check is that you're finetuning with dropout as this is often quite important.

3 replies

adarob May 5, 2022
Maintainer

Also: training_eval is the evaluation with teacher-forcing, which is why it might be inflated if your classification output is multiple tokens.

One thing to try is to use rank classification (supported via SeqIO) whereby you score the possible outputs and select the one with the highest score. This typically does better than generative classification.

peregilk May 5, 2022
Author

Thanks a lot. That the reported score is from teacher-forcing makes perfect sense.

I have done a few classification tasks with T5, and I have always had very impressive results (compared to encoder only models). But all of them have been binary tasks.Thanks a lot for the tips about using rank classification. It would be exciting to see if this makes an improvement.

From reading the documentation in the SeqIO source code it is was not entirely clear to me how this should best be implemented in the training task. Do you know if there is any sample implementation of this?

adarob May 5, 2022
Maintainer

Ah, it's actually defined in the t5 codebase: https://github.com/google-research/text-to-text-transfer-transformer/blob/9860243bbd8a93fb284edbe9d9abfc1de40b7bc8/t5/evaluation/metrics.py#L416

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation accuracy #478

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Evaluation accuracy #478

peregilk May 5, 2022

Replies: 1 comment · 3 replies

adarob May 5, 2022 Maintainer

adarob May 5, 2022 Maintainer

peregilk May 5, 2022 Author

adarob May 5, 2022 Maintainer

peregilk
May 5, 2022

Replies: 1 comment 3 replies

adarob
May 5, 2022
Maintainer

adarob May 5, 2022
Maintainer

peregilk May 5, 2022
Author

adarob May 5, 2022
Maintainer