Code implementation and Datasets for the ACL2023 Paper "NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist" ACL-Anthology
conda create -n nlgeval_env python=3.7
conda activate nlgeval_env
conda install cudatoolkit=10.1 -c pytorch -n nlgeval_env
pip install -r requirements.txt
Datasets we provided in ~/data have included scores based on human and automatic metrics in this study (including human-aligned metrics).
-
SummEval (Fabbri et al., 2021)
Source : Text source before summarized by the systems
Decoded : Systems'generation outputs
Ref-n : Ground truth human references (11 references are provided)
Model-ID : See Appendix of the paper or the original paper for more detail information
Coherence : Coherence rating by human evaluators (scale 1-5)
Consistency : Consistency rating by human evaluators (scale 1-5)
Fluency : Fluency rating by human evaluators (scale 1-5)
Relevance : Relevance rating by human evaluators (scale 1-5)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BLEU-n : BLEU score for the given output
ROUGE-n : ROUGE score for the given output
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall) -
Newsroom (Grusky et al., 2018)
This dataset is not accompanied with ground truth references. So, for measuring the performance with reference-based metrics or nearly reference-less metrics, we use the source (ArticleText) as a means of reference.
ArticleID : The unique ID of the article
ArticleText : Text source before summarized by the systems
SystemSummary : Systems'generation outputs
ArticleTitle : Title of the article
System : NLG System to execute the summarization task. See Appendix of the paper or the original paper for more detail information
CoherenceRating : Coherence rating by human evaluators (scale 1-5)
InformativenessRating : Informativeness rating by human evaluators (scale 1-5)
FluencyRating : Fluency rating by human evaluators (scale 1-5)
RelevanceRating : Relevance rating by human evaluators (scale 1-5)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BLEU : BLEU score for the given output
ROUGE : ROUGE score for the given output
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence, Relevance)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)
-
USR-Topical Chat (Mehri and Eskenazi, 2020)
Fact : The factual context of the article
Context : The preceding conversation as the context for responses
Response : Responses from the systems or human
Annotators : The annotator for the corresponding human ratings
Model : NLG System to execute the response generation task. See Appendix of the paper or the original paper for more detail information
Understandable : Understandable rating by human evaluators (binary scale 0/1, 0=not understandable, 1=understandable)
Natural : Naturalness rating by human evaluators (scale 1-3, 1=not natural, 2=somewhat/moderate, 3=good)
MaintainsContext : Rating by human evaluators for maintaining context (scale 1-3)
Engaging : Engagingness rating by human evaluators (scale 1-3)
UsesKnowledge : Engagingness rating by human evaluators (binary scale 0/1)
Overall : Overall rating by human evaluators (scale 1-5)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BLEU : BLEU score for the given output
ROUGE : ROUGE score for the given output
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Engagingness, Groundedness)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Understandability, Naturalness, Coherence, Engagingness, Groundedness, Overall) -
USR Persona Chat (Mehri and Eskenazi, 2020)
Fact : Persona context of the article
Context : The preceding conversation as the context for responses
Response : Responses from the systems or human
Annotators : The annotator for the corresponding human ratings
Model : NLG System to execute the response generation task. See Appendix of the paper or the original paper for more detail information
Understandable : Understandable rating by human evaluators (binary scale 0/1, 0=not understandable, 1=understandable)
Natural : Naturalness rating by human evaluators (scale 1-3, 1=not natural, 2=neutral/moderate, 3=good)
MaintainsContext : Rating by human evaluators for maintaining context (scale 1-3)
Engaging : Engagingness rating by human evaluators (scale 1-3)
UsesKnowledge : Engagingness rating by human evaluators (binary scale 0/1)
Overall : Overall rating by human evaluators (scale 1-5)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BLEU : BLEU score for the given output
ROUGE : ROUGE score for the given output
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Engagingness, Groundedness)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Understandability, Naturalness, Coherence, Engagingness, Groundedness, Overall)
-
UBER-PPLM ((Dathathri et al., 2020))
This dataset is an open-ended task (no ground truth references).
Prefix : A word (two words) at the beginning of the sentence as a cue for Language Model to continue the word(s) and complete them into a sentence or full text
Text : Systems'generation outputs
Domain : Topic category as a control attribute
Annotator : The annotator for the corresponding human ratings
Model : NLG System as text generator. See Appendix of the paper or the original paper for more detail information
Pairtxt : Model pair given to the annotators
Fluency : Fluency rating by human evaluators (scale 1-5)
Relevance : Relevance rating by human evaluators (binary scale 0/1)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall) -
CTRL (Keskar et al., 2019)
This dataset is an open-ended task (no ground truth references).
Prefix : A word (two words) at the beginning of the sentence as a cue for Language Model to continue the word(s) and complete them into a sentence or full text
Text : Systems'generation outputs
Domain : Topic category as a control attribute
Annotator : The annotator for the corresponding human ratings
Model : NLG System as text generator. See Appendix of the paper or the original paper for more detail information
Pairtxt : Model pair given to the annotators
Fluency : Fluency rating by human evaluators (scale 1-5)
Relevance : Relevance rating by human evaluators (binary scale 0/1)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall) -
CTRL-Eval (Ke et al., 2022)
This dataset is an open-ended task (no ground truth references).
Prefix : A word (two words) at the beginning of the sentence as a cue for Language Model to continue the word(s) and complete them into a sentence or full text
Text : Systems'generation outputs
Attribute : Topic category as a control attribute
Coherence : Coherence rating by human evaluators (scale 1-5)
Consistency : Consistency rating by human evaluators (scale 1-5)
Relevance : Relevance rating by human evaluators (binary scale 0/1)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)
We consider three (3) metrics under this category. Prior to computing the evaluation scores of the given system outputs (above datasets), the following Python implementation of the metrics need to be installed.
- CTC (Deng et al., 2021)
https://github.com/tanyuqian/ctc-gen-eval - CTRLEval (Ke et al., 2022)
https://github.com/thu-coai/ctrleval - UniEval (Zhong et al., 2022)
https://github.com/maszhongming/unieval
Datasets we provided in ~/data have included scores based on human and automatic metrics in this study (including human-aligned metrics).
However, if you would like to run the automatic metrics on your own datasets, you can see below examples of code implementation.
Prior to running the following scripts, do not forget to modify the environment name in the script.
Automatic Metric | Benchmark | Bash script |
---|---|---|
Perplexity, BLEU, ROUGE, BERTScore | Text Summarization | scripts/run_autom_newsroom.sh |
Perplexity, BLEU, ROUGE, BERTScore | Controlled Generation | scripts/run_autom_uber.sh |
UniEval | Text Summarization | scripts/run_unieval_summ.sh |
UniEval | Dialogue Generation | scripts/run_unieval_tc.sh |
\notebooks\Plot Transfer Correlation.ipynb
\notebooks\Quality-Eval.ipynb
\notebooks\System-Eval.ipynb
\notebooks\Pairwise_System_Ranking.ipynb
- GPU: ASUS Turbo GeForce GTX 1080 Ti ( RAM, 3584 CUDA cores, compute capability 6.1); CPU Intel Xeon Broadwell-EP 2683v4 @ 2.1GHz (64 hyperthreads, RAM: 1024GB).
- OS: Ubuntu 16.04.7 LTS (GNU/Linux 4.4.0-138-generic x86_64)
@inproceedings{nimah-etal-2023-nlg,
title = "{NLG} Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist",
author = "Nimah, Iftitahu and
Fang, Meng and
Menkovski, Vlado and
Pechenizkiy, Mykola",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.69",
doi = "10.18653/v1/2023.acl-long.69",
pages = "1240--1266",
abstract = "In this study, we analyze automatic evaluation metrics for Natural Language Generation (NLG), specifically task-agnostic metrics and human-aligned metrics. Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks, yet they have a weak correlation with human. Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective. However, their effectiveness at discerning system-level performance and quality of system outputs remain unclear. We present metric preference checklist as a framework to assess the effectiveness of automatic metrics in three NLG tasks: Text Summarization, Dialogue Response Generation, and Controlled Generation. Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless of their correlation level to human; and (ii) for inspecting the strengths and limitations of NLG systems via pairwise evaluation. We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks. We also show that multi-aspect human-aligned metric (UniEval) is not necessarily dominant over single-aspect human-aligned metrics (CTC, CtrlEval) and task-agnostic metrics (BLEU, BERTScore), particularly in Controlled Generation tasks.",
}
Issues and pull requests are welcomed.