This repository share the material for LREC2022 tutorial "Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview"
Table of Content (Lecture List): "https://github.com/poethan/LREC22_MetaEval_Tutorial/blob/main/lrec22_meta-eval-tutorial_lecture.pdf"
Direct Link to download our tutorial PPT slides: https://github.com/poethan/LREC22_MetaEval_Tutorial/blob/main/LREC22_Tutorial_MetaEval.pdf
A Nice Seminar Photo from our tutorial https://drive.google.com/drive/folders/1ll70UZ2jxivSYhE5S1yFKwAJ9FhvgYtt?usp=sharing
Conference Page https://lrec2022.lrec-conf.org/en/workshops-and-tutorials/tutorials-details/
Abstract Starting from 1950s, Machine Translation (MT) was challenged from different scientific solutions which included rule-based methods, example-based and statistical models (SMT), to hybrid models, and very recent years the neural models (NMT). While NMT has achieved a huge quality improvement in comparison to conventional methodologies, by taking advantages of huge amount of parallel corpora available from internet and the recently developed super computational power support with an acceptable cost, it struggles to achieve real human parity in many domains and most language pairs, if not all of them. Alongside the long road of MT research and development, quality evaluation metrics played very important roles in MT advancement and evolution. In this tutorial, we overview the traditional human judgement criteria, automatic evaluation metrics, unsupervised quality estimation models, as well as the meta-evaluation of the evaluation methods. Among these, we will also cover the very recent work in the MT evaluation (MTE) fields taking advantages of large size of pre-trained language models for automatic metric customisation towards exactly deployed language pairs and domains. In addition, we also introduce the statistical confidence estimation regarding sample size needed for human evaluation in real practice simulation.
Cite this Tutorial |
---|
@article{hanmeta, title={Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview}, author={Lifeng Han and Serge Gladkoff}, url = {https://github.com/poethan/LREC22_MetaEval_Tutorial} }
Lifeng Han and Serge Gladkoff. 2022. "Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview." Tutorial in LREC2022.
Content/Appendices |
---|
Traditional human evaluation methods of quality of machine translation have been developed for many years and this field has been developed very extensively. There are also new developments on-going.
Here is the list of methods with their brief summary.
Description: seminal work on measuring translation quality. Computers in Translation and Linguistics, a report by the Automatic Language Processing Advisory Committee (ALPAC), Publication 1416, National Academy of Sciences, National Research Council, Washington, D.C., 1966
Report: https://nap.nationalacademies.org/read/9547/chapter/1
Description: Developed by LISA (Localization and Internationalization Standards Association) Explainatory presentation: http://www.qt21.eu/launchpad/sites/default/files/QTLP%20GALA%20Webinar%203.pdf
Description: MQM 1.0 was developed as deliverable of EU QT21 project. Definitions: http://www.qt21.eu/mqm-definition/definition-2015-12-30.html Issue Types: http://www.qt21.eu/mqm-definition/issues-list-2015-12-30.html Latest version of issue types (MQM Core, see below): https://www.w3.org/community/mqmcg/2018/10/04/draft-2018-10-04/
The workgroup: https://www.sae.org/standardsdev/j2450p1.htm
The MT Summit VII paper: Deploying the SAE J2450 Translation Quality Metric in MT Projects
https://aclanthology.org/1999.mtsummit-1.41.pdf
Description: ATA translation certification program features a very interesting and robust translation quality evaluation metric
Described in: Welcome to the Real World: Professional-Level Translator Certification The International Journal of Translation and Interpreting Research, 2013 By Geoff Koby
Additional info: https://www.atanet.org/translation/summary-of-defining-translation-quality/
Description: MQM-like metric. Evolved out of the QT21/QTLaunchpad projects managed amongst others by DFKI.
References: https://www.taus.net/resources/blog/category/dynamic-quality-framework
Now discontinued. Incorporated in MQM Core
Many users may be familiar with the historical antecedents of the current MQM Core error typology.
The typology is based on the earlier TAUS system, DQF/MQM, which evolved out of the QT21/QTLaunchpad projects managed amongst others by DFKI.
Some changes appear in the top-level dimensions (where DQF-Fluency is now Linguistic conventions, and DQF-Verity has become Audience Appropriateness. Refinements in the typology deemed these designators to be more transparent to most users. See the TheMQM webpage for further explanations. These changes were agreed upon in collaboration with TAUS, and ISO 5060.
https://www.w3.org/community/mqmcg/2018/10/04/draft-2018-10-04/
The latest version of Multidimensional Quality Framework.
The work item: www.astm.org/workitem-wk46396 More detailed information: https://themqm.org/
The work item: https://www.astm.org/workitem-wk54884
The work item: https://www.iso.org/standard/80701.html
With the advent of neural machine translation new human evaluation methods have emerged:
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation
Description: human metric to quickly assess quality of MT.
Reference: https://arxiv.org/abs/2112.13833
MWEs play bottleneck in many NLP tasks. MWEs have a broad covery including idioms, metophors, and fixed/semi-fixed patterns.
AlphaMWE: https://github.com/poethan/AlphaMWE In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 44–57, online. Association for Computational Linguistics.
BLEU computes the precision for n-gram of size 1-to-4 with the coefficient of brevity penalty. The theory under this design is that if most of the outputs are right but with too short output (e.g. many meanings of the source sentences lost), then the precision value may be very high but this is not a good translation; the brevity-penalty coefficient will decrease the final score to balance this phenomenon.
First publication: https://aclanthology.org/P02-1040.pdf
First publication: https://arxiv.org/ftp/arxiv/papers/1703/1703.08748.pdf
COMET is designed to learn to predict human judgments of MT quality. It does this by using a neural system to first map the MT-generated translation, the reference translation and the source language text into neural meaning representations. It then leverages these representations in order to learn to predict a quality score that is explicitly optimized for correlation with human judgments of translation quality.
cushLEPOR: customised hLEPOR metric using Optuna for higher agreement with human judgments or pre-trained language model LaBSE
Customised hLEPOR (cushLEPOR) uses Optuna hyper-parameter optimisation framework to fine-tune hLEPOR weighting parameters towards better agreement to pre-trained language models (using LaBSE) regarding the exact MT language pairs that cushLEPOR is deployed to. cushLEPOR metric can be file tuned to correlate with other metrics; in this work authors optimize it towards professional human evaluation.
First publication: https://arxiv.org/abs/2108.09484
- N-gram Word Surface Similarity
- Syntax and Semantics
- Statistical Eval and Deep Learning Based Eval
- Reference-dependent vs Reference-free (QE)
- Eval Model Credibility
- Sample Size Confidentiality
- Agreement Measuring
- Correlations between AutoEval and HumanEval
Speakers |
---|
Lifeng Han
- Post-doctoral Research Associate in UoM
- PhD gradauted from ADAPT Research Centre, DCU. Thesis 'An investigation into multi-word expressions in machine translation' https://doras.dcu.ie/26559
- MSc in Software Engineering, thesis in MT Eval (LEPOR, hLEPOR, nLEPOR) with Excellent Award from UM. Thesis <LEPOR: An Augmented Machine Translation Evaluation Metric> https://arxiv.org/abs/1703.08748
- BSc in Math from HNU of China
- Reserach topics: Machine Translation, Evaluation, Information Extraction, Linguistics Aware NLP
- Google Scholar page (https://scholar.google.com/citations?hl=en&user=_vf3E2QAAAAJ&view_op=list_works)
Serge Gladkoff
- CEO of Logrus Global (LSP) https://logrusglobal.com
- Head of R&D of Logrus Global AI Lab http://ai-lab.logrusglobal.com/blog/
- Technical contact of ASTM WK4636 standard
- ISO ANSI expert from United States
- ResearchGate page: https://www.researchgate.net/profile/Serge-Gladkoff/research
References |
---|
Survey and Overview
- Han and Wong (2016): Machine Translation Evaluation: A Survey https://arxiv.org/abs/1605.04515 (updated in 2018 ->)
- Lifeng Han (2018) Machine Translation Evaluation Resources and Methods: A Survey https://arxiv.org/abs/1605.04515v8
- Han et al. (2021) Translation Quality Assessment: A Brief Survey on Manual and Automatic Methods. https://aclanthology.org/2021.motra-1.3/
- Lifeng Han (2022) An Overview on Machine Translation Evaluation. https://arxiv.org/abs/2202.11027 (in Chinese, English update forthcoming)
HumanEval Metrics
- S Gladkoff, L Han (2022) HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professional Post-Editing Towards More Effective MT Evaluation. LREC22. arXiv preprint arXiv:2112.13833
- Lifeng Han, Gareth Jones, and Alan Smeaton. 2020. AlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations. In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 44–57, online. Association for Computational Linguistics. https://aclanthology.org/2020.mwe-1.6/
Auto-eval and QE
- Lifeng Han, Irina Sorokina, Gleb Erofeev, Serge Gladkoff (2021) cushLEPOR: customising hLEPOR metric using Optuna for higher agreement with human judgments or pre-trained language model LaBSE. WMT21. https://aclanthology.org/2021.wmt-1.109/
- Gleb Erofeev, Irina Sorokina, Lifeng Han, Serge Gladkoff (2021) cushLEPOR uses LABSE distilled knowledge to improve correlation with human translation evaluations https://aclanthology.org/2021.mtsummit-up.28/
- Lifeng Han. 2014. LEPOR: An Augmented Machine Translation Evaluation Metric. MSc Thesis. University of Macau, Macao. https://arxiv.org/abs/1703.08748
- Han et al. 2014. Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation. https://www.hindawi.com/journals/tswj/2014/760301/ Recent Advances in Information Technology (using n-gram for precision and recall)
- Han et al. 2013. Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling. WMT13. https://aclanthology.org/W13-2245.pdf
- Han et al. (2013) "Phrase Tagset Mapping for French and English Treebanks and Its Application in Machine Translation Evaluation”. GSCL2013. https://link.springer.com/content/pdf/10.1007/978-3-642-40722-2_13.pdf (Using syntactic phrase structure)
- Han et al. (2013) "Language-independent Model for Machine Translation Evaluation with Reinforced Factors" in MT Summit. https://aclanthology.org/2013.mtsummit-posters.3.pdf (Using POS)
- Han et al. 2012. LEPOR: A Robust Evaluation Metric for Machine Translation with Augmented Factors. COLING12. https://aclanthology.org/C12-2044.pdf
Meta-eval and Confidence
- S Gladkoff, I Sorokina, L Han, A Alekseeva (2022) Measuring Uncertainty in Translation Quality Evaluation (TQE). LREC22. arXiv preprint arXiv:2111.07699 https://arxiv.org/abs/2111.07699
older Presentations
- from Motra2021: video link https://drive.google.com/drive/folders/1njFi9FyHp1mURN0_5DXW1ws6szwq2RMo?usp=sharing ppt slides link https://drive.google.com/drive/folders/15YQDJaWoKJZiStuaWXzz-2Y-KOW1Bozh?usp=sharing
Cite this tutorial |
---|
- Lifeng Han and Serge Gladkoff. 2022. Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview. Tutorial at LREC2022. June 20th, 2022, Marseille, France.
@article{hanmeta, title={Meta-Evaluation of Translation Evaluation Methods: a systematic up-to-date overview}, author={Han, Lifeng and Gladkoff, Serge}, Conferences={Tutorial in LREC2022}, Location={Marseille, France} }
Acknowledgement |
---|
- We thank the feedback and discussion on this tutorial structure and presentation form NLP group in The University of Manchester, especially Viktor Schlegel, Nhung Nguen, Haifa, Tharindu, Abdullah, Laura.
- We also thank the funding support from Uni of Manchester (via Prof Goran Nenadic's project) and previous funding from ADAPT Research Centre, DCU. A link to NLP at UniManchester https://www.cs.manchester.ac.uk/research/expertise/natural-language-processing/ and ADAPT https://www.adaptcentre.ie/
- We thank the tutorial attendees from LREC2022 for valuable discussion and feedback to this tutorial and our work.
Contact, welcome to reach out |
---|
- firstname.lastname@manchester.ac.uk (Firstname: Lifeng; Lastname: Han)