-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve training dataset #36
Comments
Apparently, Farasa uses the following datasets: (From https://aclanthology.org/D19-3037.pdf)
This pointed corpus is not openly available.
Only the test set is available.
Note that this paper may be the "diacritizeV2" API using "msa" (modern standard Arabic).
|
The Qatari Arabic Corpus from Qatari news with vocalized Arabic, is a fully pointed dataset. The QCRI people have also launched QASR, a speech dataset of Arabic but is not yet available for download: |
Inspired by the Bible inclusions, there are Arabic Bibles as well that contain pointed text! |
The Rababa models today are trained on the Tashkeela corpus.
In Tashkeela, 98% of its content come from Shamela.
There are some other additional datasets that are either pointed or can be made into pointed datasets.
Pointed datasets:
AlJazeera Learning also offers an Arabic diacriticizer, which we can test against:
The endpoint goes to:
Apparently they have two diacritization modules that can be downloaded (Java, JAR) or used via the web:
Datasets that could potentially be pointed...:
The text was updated successfully, but these errors were encountered: