-
Notifications
You must be signed in to change notification settings - Fork 192
Multilingual datasets
Nandan Thakur edited this page Jun 30, 2022
·
2 revisions
Eversince the release of the BEIR benchmark, we are expanding publicly available datasets in different languages. We focus on the monolingual (i.e. queries and documents in the same language) evaluation of different datasets.
We convert these existing datasets into the BEIR format and host them publicly on our platform:
- Mr. TyDi is a multi-lingual benchmark dataset built on TyDi-QA, covering eleven typologically diverse languages.
- mMARCO is a multilingual version of the MS MARCO passage ranking dataset across 14 languages.
- GermanQuAD is a German dataset constructed similar to the English SQuAD dataset with the German Wikipedia.
- ViHealthQA is a Vietnamese dataset constructed using questions from health-interested users asked on health websites and answers from highly qualified experts.
Language | Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
---|---|---|---|---|---|---|---|---|---|
Vietnamese | ViHealthQA | Homepage | vihealthqa |
train dev test
|
2,013 | 10K | 2.5 | Link | 7685a360a837934624a8018d826c383f |
German | GermanQuAD | Homepage | germanquad |
test |
2,044 | 2.80M | 1.0 | Link | 95a581c3162d10915a418609bcce851b |
Arabic | Mr.TyDI | Homepage | mrtydi/arabic |
train dev test
|
1,081 | 2.1M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Bengali | Mr.TyDI | Homepage | mrtydi/bengali |
train dev test
|
111 | 304K | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Finnish | Mr.TyDI | Homepage | mrtydi/finnish |
train dev test
|
1,254 | 1.9M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Indonesian | Mr.TyDI | Homepage | mrtydi/indonesian |
train dev test
|
829 | 1.47M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Japanese | Mr.TyDI | Homepage | mrtydi/japanese |
train dev test
|
720 | 7M | 1.3 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Korean | Mr.TyDI | Homepage | mrtydi/korean |
train dev test
|
421 | 1.5M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Russian | Mr.TyDI | Homepage | mrtydi/russian |
train dev test
|
995 | 9.6M | 1.2 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Swahili | Mr.TyDI | Homepage | mrtydi/swahili |
train dev test
|
670 | 136K | 1.1 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Telugu | Mr.TyDI | Homepage | mrtydi/telugu |
train dev test
|
646 | 548K | 1.0 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Thai | Mr.TyDI | Homepage | mrtydi/thai |
train dev test
|
1,190 | 568K | 1.1 | Link | 17072d0e1610bd8461d962b8ac560fc5 |
Language | Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
---|---|---|---|---|---|---|---|---|---|
Spanish | mMARCO | Homepage | mmarco/spanish |
train dev
|
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
French | mMARCO | Homepage | mmarco/french |
train dev
|
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Portuguese | mMARCO | Homepage | mmarco/portuguese |
train dev
|
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Italian | mMARCO | Homepage | mmarco/italian |
train dev
|
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Indonesian | mMARCO | Homepage | mmarco/indonesian |
train dev
|
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
German | mMARCO | Homepage | mmarco/german |
train dev
|
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Russian | mMARCO | Homepage | mmarco/russian |
train dev
|
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |
Chinese | mMARCO | Homepage | mmarco/chinese |
train dev
|
6,980 | 8.84M | 1.1 | Link | b727dbec65315a76bceaff56ad77d2c7 |