GitHub - brian224code/phoneme-bridges: 6.8611 Project (Phonetic Transfer Learning for Low-resource Languages)

Phoneme Bridges: Leveraging Phonetic Similarity for Low-Resource Language Understanding

Chris Ge, Brian Le, Daria Kryvosheieva | Final project for MIT 6.8611: Quantitative Methods for Natural Language Processing

Project Overview

Our project aims to improve language model performance on NLP tasks in low-resource languages (LRLs) through knowledge transfer from high-resource languages (HRLs). We especially focus on HRL-LRL pairs that share many similar words in pronunciation but use different writing systems. To enable knowledge transfer in this scenario, we use the STILTs finetuning method (Phang et al., 2018) and augment our finetuning datasets with romanizations. We choose mBERT as an example model and Hindi-Urdu as an example HRL-LRL pair.

Our Pipeline

Pick an NLP task. We experiment with named entity recognition (NER) and part-of-speech (POS) tagging.
Gather a dataset for the task in each of the LRL and the HRL. Retrieve the romanizations of the two datasets’ input texts using a transliterator.
Fine-tune the language model on the NLP task in the HRL, randomly replacing a fixed proportion of words in the input text of the data by their romanizations.
Further fine-tune and evaluate the resulting model on the LRL task with both text and romanization.

We use the ai4bharat transliterator and the PAN-X and UD-POS datasets from Google Xtreme for NER and POS tagging, respectively.

Figure 1: Our pipeline, shown here for the task of NER.

Experiments

For each of the PAN-X and UD-POS datasets, we fine-tune four versions of mBERT:

mBERT_text: mBERT fine-tuned directly on the Urdu dataset (no romanizations);
mBERT_roman: mBERT fine-tuned directly on the Urdu dataset (with romanizations concatenated);
mBERT_STILTs+text: mBERT intermediately fine-tuned on the Hindi dataset, then further fine-tuned on the Urdu dataset (no romanizations);
mBERT_STILTs+roman: mBERT intermediately fine-tuned on the Hindi dataset (with a quarter of the words replaced with romanizations), then further fine-tuned on the Urdu dataset (with romanizations concatenated).

Figure 2: Our four fine-tuned models and relevant comparisons.

Results

Table 1 shows the performance of our models (measured as macro-F1 score) on the two tasks, and Table 2 shows the results of our statistical significance test (paired bootstrap resampling). Overall, our method yielded improvement, but it was not statistically significant.

Model	POS Tagging Score	NER Score
mBERT_text	0.8700	0.9770
mBERT_roman	0.8728	0.9780
mBERT_STILTs+text	0.8702	0.9763
mBERT_STILTs+roman	0.8735	0.9788

Table 1: Our results (macro-F1 scores) for POS tagging and NER.

Comparison	Task	Mean Difference	95% Confidence Interval	P-value
mBERT_roman - mBERT_text	POS	0.0029	[-0.0016, 0.0071]	0.2180
mBERT_roman - mBERT_text	NER	0.0010	[-0.0066, 0.0079]	0.7680
mBERT_STILTs+roman - mBERT_STILTs+text	POS	0.0035	[-0.0019, 0.0092]	0.2200
mBERT_STILTs+roman - mBERT_STILTs+text	NER	0.0026	[-0.0038, 0.0095]	0.4100

Table 2: Statistical test results.

Code in This Repo

panx, udpos: fine-tune and evaluate models on the respective datasets
BertViz_panx.py, BertViz_udpos.py: Examine attention paid to romanizations using BertViz
bootstrapping.py: the paired bootstrap resampling significance test
dataloader.py: load datasets
romanization.py: generate romanization-augmented versions of the datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phoneme Bridges: Leveraging Phonetic Similarity for Low-Resource Language Understanding

Project Overview

Our Pipeline

Experiments

Results

Code in This Repo

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
panx		panx
udpos		udpos
.DS_Store		.DS_Store
BertViz_panx.py		BertViz_panx.py
BertViz_udpos.py		BertViz_udpos.py
Models.JPG		Models.JPG
Pipeline.png		Pipeline.png
README.md		README.md
bootstrapping.py		bootstrapping.py
dataloader.py		dataloader.py
romanization.py		romanization.py

brian224code/phoneme-bridges

Folders and files

Latest commit

History

Repository files navigation

Phoneme Bridges: Leveraging Phonetic Similarity for Low-Resource Language Understanding

Project Overview

Our Pipeline

Experiments

Results

Code in This Repo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages