The traditional Mongolian script uses an old pronunciation which differs from the modern Mongolian spoken language. This experiment tries to convert between the Mongolian script (old pronunciation) and the cyrillic script (modern pronunciation) using deep learning.
Facebook's fairseq project is used to train the converter which is added as a submodule. To install it, execute:
git clone --recursive https://github.com/tugstugi/mongolian-nlp.git
cd mongolian-nlp/bichig2cyrillic/fairseq
pip install -r requirements.txt
python setup.py build develop
80K lines modern Mongolian song lyrics are converted with
the Inner Mongolian University's online translator tool
to the Mongolian script and used as the training dataset.
To prepare binarized datasets to use with fairseq
, execute:
python create_dataset.py
# binarized dataset for Mongolian script to cyrillic
python fairseq/preprocess.py \
--trainpref dataset/train --validpref dataset/valid --testpref dataset/test \
--source-lang bichig --target-lang cyrillic \
--destdir bichig2cyrillic-bin
# binarized dataset for cyrillic to Mongolian script
python fairseq/preprocess.py \
--trainpref dataset/train --validpref dataset/valid --testpref dataset/test \
--source-lang cyrillic --target-lang bichig \
--destdir cyrillic2bichig-bin
For an online Colab demo, visit: Cyrillic2Bichig.ipynb
To train a transformer character model for 30 epochs, execute:
python fairseq/train.py --optimizer adam --lr 5e-5 --min-lr 5e-10 --lr-shrink 0.5 \
--max-tokens 1000 --arch transformer --clip-norm 0.5 \
--max-epoch 30 \
--save-dir checkpoints/cyrillic2bichig cyrillic2bichig-bin
After training, the training loss should be under 0.01.
Now convert some cyrillic texts into the traditional Mongolian script:
echo "Хэн хүнтэй үг ярина гэдэг\nХэрэг дээрээ тулалдаан юм\nХалуун хүйтэн ямар ч зэвсгээс\nХатуу зөөлөн үг хүчтэй" | \
python cyrillic2bichig.py --path checkpoints/cyrillic2bichig/checkpoint_best.pt cyrillic2bichig-bin
The above model outputs:
ᠬᠡᠨ ᠬᠦᠮᠦᠨ ᠲᠡᠢ ᠦᠭᠡ ᠶᠠᠷᠢᠨᠠ ᠭᠡᠳᠡᠭ
ᠬᠡᠷᠡᠭ ᠳᠡᠭᠡᠷᠡ ᠪᠡᠨ ᠲᠤᠯᠤᠯᠳᠤᠭᠠᠨ ᠶᠤᠮ
ᠬᠠᠯᠠᠭᠤᠨ ᠬᠦᠢᠲᠡᠨ ᠶᠠᠮᠠᠷ ᠴᠤ ᠵᠡᠪᠰᠡᠭ ᠡᠴᠡ
ᠬᠠᠲᠠᠭᠤ ᠵᠥᠭᠡᠯᠡᠨ ᠦᠭᠡ ᠬᠦᠴᠦᠲᠡᠢ
Similar to above. Here some conversion example from the Mongolian president's website:
echo "ᠮᠣᠩ᠋ᠭᠣᠯ ᠤᠯᠤᠰ ᠤᠨ ᠶᠡᠷᠦᠩᠬᠡᠢᠢᠯᠡᠭᠴᠢ ᠬ ∙ ᠪᠠᠲᠤᠲᠤᠯᠭᠠ ᠥᠨᠥᠳᠥᠷ ᠧᠦ᠋ᠷᠣᠫᠠ ᠢᠢᠨ ᠬᠣᠯᠪᠣᠭᠠ\nᠲᠡᠭᠦᠨ ᠦ ᠭᠡᠰᠢᠭᠦᠨ ᠣᠷᠣᠨ ᠨᠤᠭᠤᠳ ᠠᠴᠠ ᠮᠣᠩ᠋ᠭᠣᠯ ᠤᠯᠤᠰ ᠲᠤ ᠰᠠᠭᠤᠷᠢᠨ ᠪᠣᠯᠤᠨ ᠬᠠᠪᠰᠤᠷᠤᠨ \nᠰᠠᠭᠤᠭᠠ ᠣᠨᠴᠠ ᠪᠥᠭᠡᠳ ᠪᠦᠷᠢᠨ ᠡᠷᠬᠡᠲᠦ ᠡᠯᠴᠢᠨ ᠰᠠᠢᠢᠳ ᠨᠠᠷ ᠢ ᠬᠦᠯᠢᠶᠡᠨ ᠠᠪᠴᠤ ᠠᠭᠤᠯᠵᠠᠪᠠ" | \
python bichig2cyrillic.py --path checkpoints/bichig2cyrillic/checkpoint_best.pt bichig2cyrillic-bin
The output should look:
монгол улсын ерөнхийлөгч х баттулга өнөөдөр европийн холбоо
түүний гишүүн орнуудаас монгол улсад суурин болон хавсран
суугаа онц бүгээд бүрэн эрхэт элчин сайд нарыг хүлээн авч уулзав
If you try this network on the Mongolian script texts found on internet, it will fail! Because the majority of the Mongolian script texts are misspelled. See for more information: Coping with Problems of Unicoded Traditional Mongolian