Implementing three part-of-speech tagging algorithms—Eager, Viterbi, and Individually Most Probable Tags—and comparing their accuracy across English, Korean, and Swedish.
This project was developed as an individual assignment as part of the coursework for the “Language and Computation” at the University of St Andrews. The three algoritms were trained using data collected from Universal Dependancies Treebank. Python, along with the CoNLL-U package and NLTK, were used to process the data and train the algorithms.
- Develop and implement three algorithms of varying complexity: Eager, Viterbi, and Individually Most Probable Tags.
- Train the algorithms on corpora from three distinct languages: English, Swedish, and Korean.
- Evaluate the part-of-speech tagging accuracy for each language using unseen test sets.
pip install conullu
npm install nltk
python3 p1.py
- CoNLL-U: Utilized for parsing and organizing corpora from the UD Treebank into training and testing sets.
- NLTK: Employed to compute emission and transition probabilities essential for training the models.
- Universal Dependancies Treebank: Provided the multilingual data used for training and testing the models.