- This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet).
- We illustrate differences between Bavarian and standard German in lexical distribution, syntactic construction, and entity information.
- We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian.
- Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet.
- We supplement multi-task learning between five NER and two Bavarian-German dialect identification tasks (with gold dialect labels on our Bavarian tweets) and achieve NER SOTA on bar-wiki.
- data: final and double-annotated BarNER and BarDID data
- BarNER-final: adjudicated final annotations (train/dev/test splits) on Bavarian and some German tweets and wiki genres
- BarNER-double: individual documents with annnotations from multiple annotators; one annotator per column
- BarDID-wiki, BarDID-tweet: dialect identification data for wiki and tweet
- MaiNLP_NER_Annotation_Guidelines.pdf
- sample_machamp_scripts: sample Machamp scripts
To appear at LREC-COLING2024
https://arxiv.org/abs/2403.12749
@misc{peng2024sebastian,
title={Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data},
author={Siyao Peng and Zihang Sun and Huangyan Shan and Marie Kolm and Verena Blaschke and Ekaterina Artemova and Barbara Plank},
year={2024},
eprint={2403.12749},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- This project is supported by ERC Consolidator Grant DIALECT 101043235.