pip install PersianG2p
It uses this neural net to convertion persian texts (with arabic symbols) into phonemes text.
- arabic notation
- the characters have different forms depended on position into word
- vowels a, e, o are often not written but pronounced; for example:
- سس pronounces sos but written ss
- شش pronounces šeš but written šš
- من pronounces man but written mn
- سلام pronounces salām but written slām
- شما pronounces šomā but written šmā
- ممنون pronounces mamnun but written mmnun
- the same symbols have different pronounces: in the word مو the symbol و pronounces u, but in the word میوه this symbol goes after vowel and pronounces v; the word تو pronounses to or tu depending on the meaning; symbol ه (hā-ye docešm) pronounces like a (e) in the word نه and like h it the word آنها
- no overlap of vowel sounds
- verbs are at the end of sentence
- no sex
- no cases
- adjectives and definitions append to the end of nouns
There is the dictionary with 1867 pairs like (persian word, pronouncing of one); you also can load the dictionary with over 48 000 words by using use_large = True
in constuctor. Some of these word (in English): water, there, feeling, use, people, throw, he, can, highway, was, hall, guarantee, production, sentence, account, god, self, they know, dollar, mind, novel, earthquake, organizing, weapons, personal, martyr, necessity, opinion, french, legal, london, deprived, people, studies, source, fruit, they take, system, the light, are, and, leg, bridge, what, done, do.
Firstly, your text is normalized by hazm, after --- tokenized.
- If token is not a symbol of arabic alphabet then it does nothing.
- If token is the word from dictionary then it chooses the pronouncing from dictionary.
- Otherwise the pronouncing will be predicted by neural net.
If token was a word from dictionary then it's pronouncing is the word like ' t h i s ' (spaces between symbols and in the end and begin of word). If the word is continues then it's the predicted word. U can disable this option by setting secret = True
.
persian symbols | sound (tidy = False) | sound (tidy = True) |
---|---|---|
آ | A | ā |
ش | S | š |
ژ | Z | ž |
چ | C | č |
ء، ع | ? | ` |
Comparison with epitran
persian word | epitran convertion | PersianG2p conversion | expected |
---|---|---|---|
سلام | slɒm | salām | salām |
ممنون | mmnvn | mamnun | mamnun |
خب | xb | xab | xāb |
ساحل | sɒhl | sāhel | sāhel |
یخ | jx | yax | yax |
لاغر | lɒɣr | lāġar | lāġar |
پسته | پsth | peste | peste |
مثلث | msls | mosles | mosles |
سال ها | sɒl hɒ | sālehā | sālhā |
لذت | lzt | lazt | lezzat |
دژ | dʒ | dož | dež |
برف | brf | barf | barf |
خدا حافظ | xdɒ hɒfz | x o d ā hāfez | xodā hāfez |
دمپایی | dmپɒjj | dampāyi | dampāyi |
نشستن | nʃstn | nešastan | nešastan |
متأسفانه | mtɒʔsfɒnh | motsafe`āne | mota’assefāne |
pip install PersianG2p
from PersianG2p import Persian_g2p_converter
PersianG2Pconverter = Persian_g2p_converter()
# or
## PersianG2Pconverter = Persian_g2p_converter(use_large = True)
PersianG2Pconverter.transliterate('ما الان درحال بازی بودیم', tidy = False)
# ' m A a l A n darhAl b A z i b u d i m '
PersianG2Pconverter.transliterate('ما الان درحال بازی بودیم')
# ' m ā a l ā n darhāl b ā z i b u d i m '
Persian_g2p_converter().transliterate( "زان یار دلنوازم شکریست با شکایت", secret = True)
# 'zān yār delnavāzam šokrist bā šekāyat'
PersianG2Pconverter.transliterate('نه تنها یک کلمه')
# ' n o h t a n h ā y e k kalame'
#object() and object.transliterate() are equal if they have same arguments
PersianG2Pconverter('نه تنها یک کلمه', secret = True)
# 'noh tanhA yek kalame'
This telegram bot uses PersianG2P package. Write him to check results.
-
Fit better model (with another hyperparams or bigger dictionary)
-
Add many new words into dictionary. If u want, I will write Python/C# script for this task or even create Telegram bot