Chris Ge, Brian Le, Daria Kryvosheieva | Final project for MIT 6.8611: Quantitative Methods for Natural Language Processing
Our project aims to improve language model performance on NLP tasks in low-resource languages (LRLs) through knowledge transfer from high-resource languages (HRLs). We especially focus on HRL-LRL pairs that share many similar words in pronunciation but use different writing systems. To enable knowledge transfer in this scenario, we use the STILTs finetuning method (Phang et al., 2018) and augment our finetuning datasets with romanizations. We choose mBERT as an example model and Hindi-Urdu as an example HRL-LRL pair.
- Pick an NLP task. We experiment with named entity recognition (NER) and part-of-speech (POS) tagging.
- Gather a dataset for the task in each of the LRL and the HRL. Retrieve the romanizations of the two datasets’ input texts using a transliterator.
- Fine-tune the language model on the NLP task in the HRL, randomly replacing a fixed proportion of words in the input text of the data by their romanizations.
- Further fine-tune and evaluate the resulting model on the LRL task with both text and romanization.
We use the ai4bharat transliterator and the PAN-X and UD-POS datasets from Google Xtreme for NER and POS tagging, respectively.
Figure 1: Our pipeline, shown here for the task of NER.
For each of the PAN-X and UD-POS datasets, we fine-tune four versions of mBERT:
- mBERTtext: mBERT fine-tuned directly on the Urdu dataset (no romanizations);
- mBERTroman: mBERT fine-tuned directly on the Urdu dataset (with romanizations concatenated);
- mBERTSTILTs+text: mBERT intermediately fine-tuned on the Hindi dataset, then further fine-tuned on the Urdu dataset (no romanizations);
- mBERTSTILTs+roman: mBERT intermediately fine-tuned on the Hindi dataset (with a quarter of the words replaced with romanizations), then further fine-tuned on the Urdu dataset (with romanizations concatenated).
Figure 2: Our four fine-tuned models and relevant comparisons.
Table 1 shows the performance of our models (measured as macro-F1 score) on the two tasks, and Table 2 shows the results of our statistical significance test (paired bootstrap resampling). Overall, our method yielded improvement, but it was not statistically significant.
Model | POS Tagging Score | NER Score |
---|---|---|
mBERTtext | 0.8700 | 0.9770 |
mBERTroman | 0.8728 | 0.9780 |
mBERTSTILTs+text | 0.8702 | 0.9763 |
mBERTSTILTs+roman | 0.8735 | 0.9788 |
Table 1: Our results (macro-F1 scores) for POS tagging and NER.
Comparison | Task | Mean Difference | 95% Confidence Interval | P-value |
---|---|---|---|---|
mBERTroman - mBERTtext | POS | 0.0029 | [-0.0016, 0.0071] | 0.2180 |
NER | 0.0010 | [-0.0066, 0.0079] | 0.7680 | |
mBERTSTILTs+roman - mBERTSTILTs+text | POS | 0.0035 | [-0.0019, 0.0092] | 0.2200 |
NER | 0.0026 | [-0.0038, 0.0095] | 0.4100 |
Table 2: Statistical test results.
panx
,udpos
: fine-tune and evaluate models on the respective datasetsBertViz_panx.py
,BertViz_udpos.py
: Examine attention paid to romanizations using BertVizbootstrapping.py
: the paired bootstrap resampling significance testdataloader.py
: load datasetsromanization.py
: generate romanization-augmented versions of the datasets