Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
We, in this repository, share our labeled datasets, extracted corpora, code and scripts of the exploratory analysis, the multivariate machine learning classifiers and clusters, and the implementation and deployment of the best-performing classifier as a web-based detection system called "Egyptian Arabic Wikipedia Scanner", which all are introduced in our accepted paper, Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition, at The 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6), co-located with LREC-COLING 2024, 20-25 May 2024.
-
- Dataset Filtering, Labeling, and Cleaning
- Dataset Encoding Using Spark-NLP & CAMeLBERT:
-
Web-based Detection System/Application:
-
Corpora and Datasets:
- Arabic Wikipedia Corpora:
- Egyptian Template-translated Articles:
-
Paper Citations:
Saied Alshahrani, Hesham Haroon, Ali Elfilali, Mariama Njie, and Jeanna Matthews. 2024. Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition. arXiv preprint arXiv:2404.00565.
Saied Alshahrani, Hesham Haroon, Ali Elfilali, Mariama Njie, and Jeanna Matthews. 2024. Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, pages 31–45, Torino, Italia. ELRA and ICCL.*