-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 3794561
Showing
10 changed files
with
11,825 additions
and
0 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# HornMT | ||
|
||
The `HornMT` repository contains data and the associated metadata for the project [Machine Translation Benchmark Dataset for Languages in the Horn of Africa](https://lesan.ai/benchmark). The goal is to create a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems for languages in the Horn of Africa. | ||
|
||
|
||
## Supported Languages | ||
|
||
Language | ISO 639-3 code | ||
------------- | ------------- | ||
Afar | aaf | ||
Amharic | amh | ||
English | eng | ||
Oromo | dorm | ||
Somali | Sam | ||
Tigrinya | tir | ||
|
||
|
||
## Content | ||
|
||
`data/` contains one text file per language and each file contains news snippets in the same order for each language. | ||
|
||
``` | ||
data | ||
├── aar.txt | ||
├── amh.txt | ||
├── eng.txt | ||
├── orm.txt | ||
├── som.txt | ||
└── tir.txt | ||
``` | ||
|
||
`metadata.tsv` contains tab separated data describing each news snippet. The metadata contains the following fields. | ||
|
||
- **Scope** - describes whether the news is global or local. It takes two values: Global news and Local news. | ||
- **Category** - News category covering the following 12 topics | ||
- Art and Culture | ||
- Business and Economy | ||
- Conflicts and Attacks | ||
- Disaster and Accidents | ||
- Entertainment | ||
- Environment | ||
- Health | ||
- International Relations | ||
- Law and Crime | ||
- Politics | ||
- Science and Technology | ||
- Sport | ||
- **Source** - List of one or more URLs from which the news content is extracted or based on. | ||
- **Domain** - TLD corresponding to the URL(s) in Source. | ||
- **Date** - The publication date of the source article. The format is yyyy-mm-dd. | ||
|
||
## Other formats | ||
|
||
All the data and associated metadata together in one file is also available in other file formats. | ||
|
||
`HornMT.xlsx` - data and associated metadata in xlsx format. | ||
|
||
`HornMT.json` - data and associated metadata in json format. | ||
|
||
Below is an example row. | ||
|
||
```json | ||
{ | ||
"data":{ | ||
"eng":"The World Meteorological Organisation reports that the ozone layer is damaged to its worst extent ever in the Arctic.", | ||
"aaf":"Baad Metrolojih Eglali Areketekeh Addal Ozonih qelu faxe waktik lafetle calat biyakisem xayose.", | ||
"amh":"የአለማ የአየር ንብረት ድርጅት በአርክቲክ አካባቢ ያለው የኦዞን ምንጣፍ ከፍተኛ ጉዳት እንደደረሰበት አስታወቀ፡፡", | ||
"orm":"Dhaabbanni Meetiroolojii Addunyaa baqqaanni oozonii Arkiitik keessatti gara sadarkaa isa hamaa haga ammaatti akka miidhame gabaase.", | ||
"som":"Ururka Saadaasha Hawada Adduunka ayaa ku warramaya in lakabka ozoneka ee Ka koreeya dhulka baraflayda uu waxyeelladii abid ugu darnaa soo gaadhay.", | ||
"tir":"ውድብ ሜትሮሎጂ ዓለም ኣብ ኣርክቲክ ዝርከብ ናሕሲ ኦዞን ኣዝዩ ብዝኸፍአ ደረጃ ከምዝተጎድአ ሓቢሩ፡፡" | ||
}, | ||
"metadata":{ | ||
"scope":"Global", | ||
"category":"Science and Technology", | ||
"source":"https://www.independent.co.uk/environment/climate-change/ozone-layer-damaged-by-unusually-harsh-winter-2263653.html", | ||
"domain":"www.independent.co.uk", | ||
"date":"2011-04-05" | ||
} | ||
} | ||
``` |
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.