This thesis attempts at correction of some errors and inconsistencies in different treebanks. The inconsistencies can be related to linguistic constructions, failure of the guidelines of annotation, failure to understand the guidelines on annotator's part, or random errors caused by annotators, among others. We propose a metric to attest the POS annotation consistency of different treebanks in the same language, when the annotation guidelines remain the same. We offer solutions to some previously identified inconsistencies in the scope of the Universal Dependencies Project, and check the viability of a proposed inconsistency detection tool in a low-resource setting. The solutions discussed in the thesis are language-neutral, intended to work with multiple languages with efficiency.
- Estimating POS Annotation Consistency of Different Treebanks in a Language
- conj_head: Head Identification in Coordinating Conjunctions
- Mining Errors in Low-Resource Languages by Combining LISCA And Cross-Validation
- AUX vs. VERB: Attempt at Separation of Verbs and Auxiliary Verbs
The repository contains the data in form of codes, and the experiment results. The thesis was finished in July 2020.
Supervisor: Dan Zeman, UFAL, Charles University, Prague
Co-Supervisor: Koldo Gojenola, Computer Languages and Systems, University of Basque Country (UPV-EHU), Spain
Thesis Main Document (Latex Source)