Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate "case ending" training #38

Open
ronaldtse opened this issue Aug 23, 2021 · 0 comments
Open

Investigate "case ending" training #38

ronaldtse opened this issue Aug 23, 2021 · 0 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@ronaldtse
Copy link
Contributor

ronaldtse commented Aug 23, 2021

https://arxiv.org/pdf/2002.01207.pdf

From the QCRI people.

Arabic Diacritic Recovery Using a Feature-Rich biLSTM Model
Kareem Darwish, Ahmed Abdelali, Hamdy Mubarak, Mohamed Eldesouki
Qatar Computing Research Institute. Hamad Bin Khalifa University, Doha. Qatar

Diacritics (short vowels) are typically omitted when writing Arabic text, and readers have to reintroduce them to correctly pronounce words. There are two types of Arabic diacritics: the first are core-word diacritics (CW), which specify the lexical selection, and the second are case endings (CE), which typically appear at the end of the word stem and generally specify their syntactic roles. Recovering CEs is relatively harder than recovering core-word diacritics due to inter-word dependencies, which are often distant. In this paper, we use a feature-rich recurrent neural network model that uses a variety of linguistic and surface- level features to recover both core word diacritics and case endings. Our model surpasses all previous state-of-the-art systems with a CW error rate (CWER) of 2.86% and a CE error rate (CEER) of 3.7% for Modern Standard Arabic (MSA) and CWER of 2.2% and CEER of 2.5% for Classical Arabic (CA). When combining diacritized word cores with case endings, the resultant word error rate is 6.0% and 4.3% for MSA and CA respectively. This highlights the effectiveness of feature engineering for such deep neural models.

For Case Ending:

MSA Results: As the results show, our baseline DNN system outperforms all state-of-the-art systems. Further, adding more features yielded better results overall. Surface-level features resulted in the most gain, followed by POS tags, and lastly stem templates. Further, adding head and tail characters along with a list of sukun words and named entities led to further improvement. Our proposed feature-rich system has a CEER that is approximately 61% lower than any of the state-of-the-art systems.

CA Results: The results show that the POS tagging features led to the most improvements followed by the surface features. Combining all features led to the best results with WER of 2.5%. As we saw for CW diacritics, using our best MSA system to diacritize CA led to significantly lower results with CEER of 8.9%.

@ronaldtse ronaldtse added the help wanted Extra attention is needed label Aug 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants