Skip to content

Latest commit

 

History

History
237 lines (157 loc) · 14.1 KB

README.md

File metadata and controls

237 lines (157 loc) · 14.1 KB

LREC22_MetaEval_Tutorial

This repository share the material for LREC2022 tutorial "Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview"

up2date new location=>

Table of Content (Lecture List): "https://github.com/poethan/LREC22_MetaEval_Tutorial/blob/main/lrec22_meta-eval-tutorial_lecture.pdf"

Direct Link to download our tutorial PPT slides: https://github.com/poethan/LREC22_MetaEval_Tutorial/blob/main/LREC22_Tutorial_MetaEval.pdf

A Nice Seminar Photo from our tutorial https://drive.google.com/drive/folders/1ll70UZ2jxivSYhE5S1yFKwAJ9FhvgYtt?usp=sharing

Conference Page https://lrec2022.lrec-conf.org/en/workshops-and-tutorials/tutorials-details/

Abstract Starting from 1950s, Machine Translation (MT) was challenged from different scientific solutions which included rule-based methods, example-based and statistical models (SMT), to hybrid models, and very recent years the neural models (NMT). While NMT has achieved a huge quality improvement in comparison to conventional methodologies, by taking advantages of huge amount of parallel corpora available from internet and the recently developed super computational power support with an acceptable cost, it struggles to achieve real human parity in many domains and most language pairs, if not all of them. Alongside the long road of MT research and development, quality evaluation metrics played very important roles in MT advancement and evolution. In this tutorial, we overview the traditional human judgement criteria, automatic evaluation metrics, unsupervised quality estimation models, as well as the meta-evaluation of the evaluation methods. Among these, we will also cover the very recent work in the MT evaluation (MTE) fields taking advantages of large size of pre-trained language models for automatic metric customisation towards exactly deployed language pairs and domains. In addition, we also introduce the statistical confidence estimation regarding sample size needed for human evaluation in real practice simulation.

Cite this Tutorial

@article{hanmeta, title={Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview}, author={Lifeng Han and Serge Gladkoff}, url = {https://github.com/poethan/LREC22_MetaEval_Tutorial} }

Lifeng Han and Serge Gladkoff. 2022. "Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview." Tutorial in LREC2022.

Content/Appendices

Human evaluation methods

Traditional Human Evaluation

Traditional human evaluation methods of quality of machine translation have been developed for many years and this field has been developed very extensively. There are also new developments on-going.

Here is the list of methods with their brief summary.

ALPAC report

Description: seminal work on measuring translation quality. Computers in Translation and Linguistics, a report by the Automatic Language Processing Advisory Committee (ALPAC), Publication 1416, National Academy of Sciences, National Research Council, Washington, D.C., 1966

Report: https://nap.nationalacademies.org/read/9547/chapter/1

LISA QA Model

Description: Developed by LISA (Localization and Internationalization Standards Association) Explainatory presentation: http://www.qt21.eu/launchpad/sites/default/files/QTLP%20GALA%20Webinar%203.pdf

MQM 1.0

Description: MQM 1.0 was developed as deliverable of EU QT21 project. Definitions: http://www.qt21.eu/mqm-definition/definition-2015-12-30.html Issue Types: http://www.qt21.eu/mqm-definition/issues-list-2015-12-30.html Latest version of issue types (MQM Core, see below): https://www.w3.org/community/mqmcg/2018/10/04/draft-2018-10-04/

SAE J2450

The workgroup: https://www.sae.org/standardsdev/j2450p1.htm
The MT Summit VII paper: Deploying the SAE J2450 Translation Quality Metric in MT Projects https://aclanthology.org/1999.mtsummit-1.41.pdf

ATA Translation Certification Program

Description: ATA translation certification program features a very interesting and robust translation quality evaluation metric

Described in: Welcome to the Real World: Professional-Level Translator Certification The International Journal of Translation and Interpreting Research, 2013 By Geoff Koby

Additional info: https://www.atanet.org/translation/summary-of-defining-translation-quality/

TAUS DQF

Description: MQM-like metric. Evolved out of the QT21/QTLaunchpad projects managed amongst others by DFKI.

References: https://www.taus.net/resources/blog/category/dynamic-quality-framework

Now discontinued. Incorporated in MQM Core

MQM Core

Many users may be familiar with the historical antecedents of the current MQM Core error typology. The typology is based on the earlier TAUS system, DQF/MQM, which evolved out of the QT21/QTLaunchpad projects managed amongst others by DFKI. Some changes appear in the top-level dimensions (where DQF-Fluency is now Linguistic conventions, and DQF-Verity has become Audience Appropriateness. Refinements in the typology deemed these designators to be more transparent to most users. See the TheMQM webpage for further explanations. These changes were agreed upon in collaboration with TAUS, and ISO 5060.
https://www.w3.org/community/mqmcg/2018/10/04/draft-2018-10-04/

MQM 2.0 (ASTM WK46396) Analytic Evaluation of Translation Quality

The latest version of Multidimensional Quality Framework.

The work item: www.astm.org/workitem-wk46396 More detailed information: https://themqm.org/

HQuest (ASTM WK54484) - holistic system for measuring translation quality

The work item: https://www.astm.org/workitem-wk54884

ISO/DIS 5060 - Translation services — Evaluation of translation output — General guidance

The work item: https://www.iso.org/standard/80701.html

Advanced Human Evaluation methods

With the advent of neural machine translation new human evaluation methods have emerged:

HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation

Description: human metric to quickly assess quality of MT.

Reference: https://arxiv.org/abs/2112.13833

Evaluating Multi-word Expression (MWE) MT Quality

MWEs play bottleneck in many NLP tasks. MWEs have a broad covery including idioms, metophors, and fixed/semi-fixed patterns.

AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations.

AlphaMWE: https://github.com/poethan/AlphaMWE In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 44–57, online. Association for Computational Linguistics.

Automatic evaluation methods

BLEU

BLEU computes the precision for n-gram of size 1-to-4 with the coefficient of brevity penalty. The theory under this design is that if most of the outputs are right but with too short output (e.g. many meanings of the source sentences lost), then the precision value may be very high but this is not a good translation; the brevity-penalty coefficient will decrease the final score to balance this phenomenon.

First publication: https://aclanthology.org/P02-1040.pdf

hLEPOR (LEngth penalty, Precision, n-gram pOsition difference penalty and Recall)

First publication: https://arxiv.org/ftp/arxiv/papers/1703/1703.08748.pdf

COMET (Crosslingual Optimized Metric for Evaluation of Translation)

COMET is designed to learn to predict human judgments of MT quality. It does this by using a neural system to first map the MT-generated translation, the reference translation and the source language text into neural meaning representations. It then leverages these representations in order to learn to predict a quality score that is explicitly optimized for correlation with human judgments of translation quality.

https://resources.unbabel.com/blog/why-we-believe-high-quality-machine-translation-really-is-possible

cushLEPOR: customised hLEPOR metric using Optuna for higher agreement with human judgments or pre-trained language model LaBSE

Customised hLEPOR (cushLEPOR) uses Optuna hyper-parameter optimisation framework to fine-tune hLEPOR weighting parameters towards better agreement to pre-trained language models (using LaBSE) regarding the exact MT language pairs that cushLEPOR is deployed to. cushLEPOR metric can be file tuned to correlate with other metrics; in this work authors optimize it towards professional human evaluation.

First publication: https://arxiv.org/abs/2108.09484

Classification of automatic methods of translation quality evaluation

  • N-gram Word Surface Similarity
  • Syntax and Semantics
  • Statistical Eval and Deep Learning Based Eval
  • Reference-dependent vs Reference-free (QE)

Meta Evaluation of translation quality metrics

  • Eval Model Credibility
  • Sample Size Confidentiality
  • Agreement Measuring
  • Correlations between AutoEval and HumanEval
Speakers

Lifeng Han

Serge Gladkoff

References

Survey and Overview

HumanEval Metrics

  • S Gladkoff, L Han (2022) HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation. LREC22. arXiv preprint arXiv:2112.13833
  • Lifeng Han, Gareth Jones, and Alan Smeaton. 2020. AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 44–57, online. Association for Computational Linguistics. https://aclanthology.org/2020.mwe-1.6/

Auto-eval and QE

Meta-eval and Confidence

  • S Gladkoff, I Sorokina, L Han, A Alekseeva (2022) Measuring Uncertainty in Translation Quality Evaluation (TQE). LREC22. arXiv preprint arXiv:2111.07699 https://arxiv.org/abs/2111.07699

older Presentations

Cite this tutorial
  • Lifeng Han and Serge Gladkoff. 2022. Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview. Tutorial at LREC2022. June 20th, 2022, Marseille, France.

@article{hanmeta, title={Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview}, author={Han, Lifeng and Gladkoff, Serge}, Conferences={Tutorial in LREC2022}, Location={Marseille, France} }

Acknowledgement
  • We thank the feedback and discussion on this tutorial structure and presentation form NLP group in The University of Manchester, especially Viktor Schlegel, Nhung Nguen, Haifa, Tharindu, Abdullah, Laura.
  • We also thank the funding support from Uni of Manchester (via Prof Goran Nenadic's project) and previous funding from ADAPT Research Centre, DCU. A link to NLP at UniManchester https://www.cs.manchester.ac.uk/research/expertise/natural-language-processing/ and ADAPT https://www.adaptcentre.ie/
  • We thank the tutorial attendees from LREC2022 for valuable discussion and feedback to this tutorial and our work.
Contact, welcome to reach out