Improve training dataset #36

ronaldtse · 2021-08-23T19:20:34Z

The Rababa models today are trained on the Tashkeela corpus.

In Tashkeela, 98% of its content come from Shamela.

There are some other additional datasets that are either pointed or can be made into pointed datasets.

Pointed datasets:

Shamela offers a full download of 6.8 GB of its books, which most if not all are pointed Arabic
- https://shamela.ws/page/download
The University of Leeds uses this: https://corpus.quran.com/java/uthmaniscript.jsp
- It uses the Tanzil distribution of Quran that includes pointed text: https://tanzil.net/download/
- This is a pointed dataset, can be immediately useable by supplementing it to the old Tashkeela
K. Aissa, Maqola, a collection of best arabic citations, 2016.(Online)(. Available) http://maqola.org
- This is one of the sources of Tashkeela, pointed dataset.
AlJazeera Learning https://learning.aljazeera.net/ar
- This is one of the sources of Tashkeela, pointed, but needs crawling to obtain text.

AlJazeera Learning also offers an Arabic diacriticizer, which we can test against:

https://learning.aljazeera.net/en/pages/تشكيل-vocalization

The endpoint goes to:

curl 'https://farasa-api.qcri.org/msa/webapi/diacritizeV2' \
-X 'POST' \
-H 'Accept: */*' \
-H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'Origin: https://quiz.aljazeera.net' \
-H 'Accept-Encoding: gzip, deflate, br' \
-H 'Host: farasa-api.qcri.org' \
-H 'Content-Length: 75' \
-H 'Accept-Language: en-us' \
-H 'Connection: keep-alive' \
--data 'text=%D8%B5%D9%81%D8%AD%D8%A9+%D8%A7%D9%84%D8%AA%D8%B4%D9%83%D9%8A%D9%84%0A'

Apparently they have two diacritization modules that can be downloaded (Java, JAR) or used via the web:

Datasets that could potentially be pointed...:

The OSIAN Corpus
- It contains lemmatized words but is apparently only 95% accurate, might not be even useful
Bibliotheca Alexandrina has the International Corpus of Arabic http://www.bibalex.org/ica

The text was updated successfully, but these errors were encountered:

ronaldtse · 2021-08-23T19:32:19Z

Apparently, Farasa uses the following datasets:

(From https://aclanthology.org/D19-3037.pdf)

For MSA, we used a diacritized corpus of 4.5m words for training (Darwish et al., 2017; Mubarak et al., 2019). This corpus covers different genres such as politics, economy, religion, sports, society, etc. And for testing, we used the freely available WikiNews corpus (Darwish et al., 2017) which contains 18.3k words and covers multiple genres.

This pointed corpus is not openly available.

For CA, we obtained a classical diacritized corpus of 65m words from a publisher. We used 5k random sentences (400k words) for testing, and we used the remaining words for training. We are making the test set available at: https://bit.ly/2KuOvkN.

Only the test set is available.

For DA, we used the corpora described in (Darwish et al., 2018), which is composed of two diacritized translations of the New Testament into Moroccan (DA-MA) and Tunisian (DA-TN). These corpora contain 166k and 157k words respectively. For each dialect, we split the diacritized corpora into 70/10/20 for training/validation/testing splits respectively. Our splits exactly match those of Abdelali et al. (2018). We used 5-fold cross validation.

Note that this paper may be the "diacritizeV2" API using "msa" (modern standard Arabic).

In sum, three resources were explored for dia- critics recovery, namely:
– For CA: Qura’nic text with 18k words.
– For DA: LDC CallHome with 160k words; the Moroccan and Tunisian Bibles with 166k and 157k words respectively.
– For MSA: LDC Arabic Treebank with 340k words (v3.2); and a proprietary corpus of 4.5m words (Darwish et al., 2017).

ronaldtse · 2021-08-23T19:39:58Z

The Qatari Arabic Corpus from Qatari news with vocalized Arabic, is a fully pointed dataset.
http://www.ifp.illinois.edu/speech/dialect/data.shtml

The QCRI people have also launched QASR, a speech dataset of Arabic but is not yet available for download:
https://arabicspeech.org/qasr/

ronaldtse · 2021-08-23T19:50:42Z

Inspired by the Bible inclusions, there are Arabic Bibles as well that contain pointed text!

ronaldtse mentioned this issue Aug 23, 2021

Compare quality with Farasa #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve training dataset #36

Improve training dataset #36

ronaldtse commented Aug 23, 2021

ronaldtse commented Aug 23, 2021 •

edited

Loading

ronaldtse commented Aug 23, 2021

ronaldtse commented Aug 23, 2021

Improve training dataset #36

Improve training dataset #36

Comments

ronaldtse commented Aug 23, 2021

ronaldtse commented Aug 23, 2021 • edited Loading

ronaldtse commented Aug 23, 2021

ronaldtse commented Aug 23, 2021

ronaldtse commented Aug 23, 2021 •

edited

Loading