This repo contains all the code, data, and the thesis for MA Digital Text Analysis at the University of Antwerp.
The task of Native Language Identification aims at predicting the native language (L1) of an author given only texts produced in a foreign language (L2). English is spoken by a minority of native speakers compared to those who speak it as an L2. This means that a large volume of English content is created by non-native English speakers. Therefore, it is crucial to devise strategies for determining the native language of non-native English writers. The study at hand introduces the NLI task for native speakers of Greek with English as L2. The corpus that we used is a collection of similar-topic posts from Reddit written by native speakers of Greek in English, along with English posts written by native English speakers from the UK and the US. This study provides new insights into the challenges and opportunities of identifying languages in multilingual contexts but also compares the performance across a variety of features of linguistic nature and employing both statistical and neural algorithms to shed light on how the two approaches differ in performance while tackling the NLI task. It lays the first stones to introduce the NLI task to other L1s in the future, train classifiers and check their robustness by finding that the combination of word embeddings with simple
Keywords: Machine Learning, Natural Language Processing, Reddit, Thesis, Word Embeddings, Logistic Regression, Support Vector Machines, Native Language Identification, BERT model, DistilBERT model
If you use this code or thesis, please cite it as such:
@mastersthesis{boumparis-2023-native,
author = {Boumparis, Dimitrios},
title = {{Native Language Identification of Greek in English-Written Texts: A Comparison of Machine Learning Approaches}},
school = {University of Antwerp},
year = {2023},
month = {January},
url = {https://github.com/dimboump/GreekNLI},
note = {Master's thesis}
}