Native Language Identification of Greek in English-Written Texts

This repo contains all the code, data, and the thesis for MA Digital Text Analysis at the University of Antwerp.

Abstract

The task of Native Language Identification aims at predicting the native language (L1) of an author given only texts produced in a foreign language (L2). English is spoken by a minority of native speakers compared to those who speak it as an L2. This means that a large volume of English content is created by non-native English speakers. Therefore, it is crucial to devise strategies for determining the native language of non-native English writers. The study at hand introduces the NLI task for native speakers of Greek with English as L2. The corpus that we used is a collection of similar-topic posts from Reddit written by native speakers of Greek in English, along with English posts written by native English speakers from the UK and the US. This study provides new insights into the challenges and opportunities of identifying languages in multilingual contexts but also compares the performance across a variety of features of linguistic nature and employing both statistical and neural algorithms to shed light on how the two approaches differ in performance while tackling the NLI task. It lays the first stones to introduce the NLI task to other L1s in the future, train classifiers and check their robustness by finding that the combination of word embeddings with simple $n$-gram features on characters and Universal Parts-of-Speech tags achieves $82.6%$ mean F1-macro score after 10-fold Cross-Validation, surpassing most previous NLI studies. In addition, our best pre-trained BERT-based approach achieves equally high results ($82.1%$). A hard-voting ensemble model was also trained by combining three classifiers with different linguistic and word-embeddings features, however without resulting in performance increase. The error analysis conducted thereafter suggests that the short-form nature of the posts and the common usage of words that are mainly found in the opposite class could be some of the main reasons that these results could be attributed to. Finally, an important finding could potentially indicate that L1 word etymology could relate to L2 lexical choices, which would make an attractive case for upcoming research in the field of NLI.

Keywords: Machine Learning, Natural Language Processing, Reddit, Thesis, Word Embeddings, Logistic Regression, Support Vector Machines, Native Language Identification, BERT model, DistilBERT model

Citation

If you use this code or thesis, please cite it as such:

@mastersthesis{boumparis-2023-native,
    author = {Boumparis, Dimitrios},
    title = {{Native Language Identification of Greek in English-Written Texts: A Comparison of Machine Learning Approaches}},
    school = {University of Antwerp},
    year = {2023},
    month = {January},
    url = {https://github.com/dimboump/GreekNLI},
    note = {Master's thesis}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
01-preprocessing-feature-extraction		01-preprocessing-feature-extraction
02-models		02-models
data/no_maxlen		data/no_maxlen
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Boumparis (2023) Native Language Identification of Greek in English-Written Texts.pdf		Boumparis (2023) Native Language Identification of Greek in English-Written Texts.pdf
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Native Language Identification of Greek in English-Written Texts

Abstract

Citation

About

Languages

dimboump/GreekNLI

Folders and files

Latest commit

History

Repository files navigation

Native Language Identification of Greek in English-Written Texts

Abstract

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages