The move towards preserving judgement disagreements in NLP requires the identification of adequate evaluation metrics. We identify a set of key properties that such metrics should have, and assess the extent to which natural candidates for soft evaluation such as Cross Entropy satisfy such properties. We employ a theoretical framework, supported by a visual approach, by practical examples, and by the analysis of a real case scenario. Our results indicate that Cross Entropy can result in fairly paradoxical results in some cases, whereas other measures Manhattan distance and Euclidean distance exhibit a more intuitive behavior, at least for the case of binary classification.
If you found our work useful, please cite our papers:
Soft metrics for evaluation with disagreements: an assessment
@inproceedings{rizzi2024soft,
title={Soft metrics for evaluation with disagreements: an assessment},
author={Rizzi, Giulia and Leonardelli, Elisa and Poesio, Massimo and Uma, Alexandra and Pavlovic, Maja and Paun, Silviu and Rosso, Paolo and Fersini, Elisabetta},
booktitle={Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives)@ LREC-COLING 2024},
pages={84--94},
year={2024}
}