Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve training dataset #36

Open
ronaldtse opened this issue Aug 23, 2021 · 3 comments
Open

Improve training dataset #36

ronaldtse opened this issue Aug 23, 2021 · 3 comments

Comments

@ronaldtse
Copy link
Contributor

The Rababa models today are trained on the Tashkeela corpus.

In Tashkeela, 98% of its content come from Shamela.

There are some other additional datasets that are either pointed or can be made into pointed datasets.

Pointed datasets:

AlJazeera Learning also offers an Arabic diacriticizer, which we can test against:

The endpoint goes to:

curl 'https://farasa-api.qcri.org/msa/webapi/diacritizeV2' \
-X 'POST' \
-H 'Accept: */*' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Origin: https://quiz.aljazeera.net' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: farasa-api.qcri.org' \
-H 'Content-Length: 75' \
-H 'Accept-Language: en-us' \
-H 'Connection: keep-alive' \
--data 'text=%D8%B5%D9%81%D8%AD%D8%A9+%D8%A7%D9%84%D8%AA%D8%B4%D9%83%D9%8A%D9%84%0A'

Apparently they have two diacritization modules that can be downloaded (Java, JAR) or used via the web:

Datasets that could potentially be pointed...:

@ronaldtse
Copy link
Contributor Author

ronaldtse commented Aug 23, 2021

Apparently, Farasa uses the following datasets:

(From https://aclanthology.org/D19-3037.pdf)

For MSA, we used a diacritized corpus of 4.5m words for training (Darwish et al., 2017; Mubarak et al., 2019). This corpus covers different genres such as politics, economy, religion, sports, society, etc. And for testing, we used the freely available WikiNews corpus (Darwish et al., 2017) which contains 18.3k words and covers multiple genres.

This pointed corpus is not openly available.

For CA, we obtained a classical diacritized corpus of 65m words from a publisher. We used 5k random sentences (400k words) for testing, and we used the remaining words for training. We are making the test set available at: https://bit.ly/2KuOvkN.

Only the test set is available.

For DA, we used the corpora described in (Darwish et al., 2018), which is composed of two diacritized translations of the New Testament into Moroccan (DA-MA) and Tunisian (DA-TN). These corpora contain 166k and 157k words respectively. For each dialect, we split the diacritized corpora into 70/10/20 for training/validation/testing splits respectively. Our splits exactly match those of Abdelali et al. (2018). We used 5-fold cross validation.

Note that this paper may be the "diacritizeV2" API using "msa" (modern standard Arabic).

In sum, three resources were explored for dia- critics recovery, namely:
– For CA: Qura’nic text with 18k words.
– For DA: LDC CallHome with 160k words; the Moroccan and Tunisian Bibles with 166k and 157k words respectively.
– For MSA: LDC Arabic Treebank with 340k words (v3.2); and a proprietary corpus of 4.5m words (Darwish et al., 2017).

@ronaldtse
Copy link
Contributor Author

The Qatari Arabic Corpus from Qatari news with vocalized Arabic, is a fully pointed dataset.
http://www.ifp.illinois.edu/speech/dialect/data.shtml

The QCRI people have also launched QASR, a speech dataset of Arabic but is not yet available for download:
https://arabicspeech.org/qasr/

@ronaldtse
Copy link
Contributor Author

Inspired by the Bible inclusions, there are Arabic Bibles as well that contain pointed text!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant