Skip to content

Latest commit

 

History

History
95 lines (79 loc) · 4.7 KB

README.md

File metadata and controls

95 lines (79 loc) · 4.7 KB

Phoneme Bridges: Leveraging Phonetic Similarity for Low-Resource Language Understanding

Chris Ge, Brian Le, Daria Kryvosheieva | Final project for MIT 6.8611: Quantitative Methods for Natural Language Processing

Project Overview

Our project aims to improve language model performance on NLP tasks in low-resource languages (LRLs) through knowledge transfer from high-resource languages (HRLs). We especially focus on HRL-LRL pairs that share many similar words in pronunciation but use different writing systems. To enable knowledge transfer in this scenario, we use the STILTs finetuning method (Phang et al., 2018) and augment our finetuning datasets with romanizations. We choose mBERT as an example model and Hindi-Urdu as an example HRL-LRL pair.

Our Pipeline

  1. Pick an NLP task. We experiment with named entity recognition (NER) and part-of-speech (POS) tagging.
  2. Gather a dataset for the task in each of the LRL and the HRL. Retrieve the romanizations of the two datasets’ input texts using a transliterator.
  3. Fine-tune the language model on the NLP task in the HRL, randomly replacing a fixed proportion of words in the input text of the data by their romanizations.
  4. Further fine-tune and evaluate the resulting model on the LRL task with both text and romanization.

We use the ai4bharat transliterator and the PAN-X and UD-POS datasets from Google Xtreme for NER and POS tagging, respectively.

Figure 1: Our pipeline, shown here for the task of NER.

Experiments

For each of the PAN-X and UD-POS datasets, we fine-tune four versions of mBERT:

  1. mBERTtext: mBERT fine-tuned directly on the Urdu dataset (no romanizations);
  2. mBERTroman: mBERT fine-tuned directly on the Urdu dataset (with romanizations concatenated);
  3. mBERTSTILTs+text: mBERT intermediately fine-tuned on the Hindi dataset, then further fine-tuned on the Urdu dataset (no romanizations);
  4. mBERTSTILTs+roman: mBERT intermediately fine-tuned on the Hindi dataset (with a quarter of the words replaced with romanizations), then further fine-tuned on the Urdu dataset (with romanizations concatenated).

Figure 2: Our four fine-tuned models and relevant comparisons.

Results

Table 1 shows the performance of our models (measured as macro-F1 score) on the two tasks, and Table 2 shows the results of our statistical significance test (paired bootstrap resampling). Overall, our method yielded improvement, but it was not statistically significant.

Model POS Tagging Score NER Score
mBERTtext 0.8700 0.9770
mBERTroman 0.8728 0.9780
mBERTSTILTs+text 0.8702 0.9763
mBERTSTILTs+roman 0.8735 0.9788

Table 1: Our results (macro-F1 scores) for POS tagging and NER.

Comparison Task Mean Difference 95% Confidence Interval P-value
mBERTroman - mBERTtext POS 0.0029 [-0.0016, 0.0071] 0.2180
NER 0.0010 [-0.0066, 0.0079] 0.7680
mBERTSTILTs+roman - mBERTSTILTs+text POS 0.0035 [-0.0019, 0.0092] 0.2200
NER 0.0026 [-0.0038, 0.0095] 0.4100

Table 2: Statistical test results.

Code in This Repo

  • panx, udpos: fine-tune and evaluate models on the respective datasets
  • BertViz_panx.py, BertViz_udpos.py: Examine attention paid to romanizations using BertViz
  • bootstrapping.py: the paired bootstrap resampling significance test
  • dataloader.py: load datasets
  • romanization.py: generate romanization-augmented versions of the datasets