Skip to content

Multilingual datasets

Nandan Thakur edited this page Jun 30, 2022 · 2 revisions

🍻 Multilingual BEIR datasets

Eversince the release of the BEIR benchmark, we are expanding publicly available datasets in different languages. We focus on the monolingual (i.e. queries and documents in the same language) evaluation of different datasets.

We convert these existing datasets into the BEIR format and host them publicly on our platform:

  • Mr. TyDi is a multi-lingual benchmark dataset built on TyDi-QA, covering eleven typologically diverse languages.
  • mMARCO is a multilingual version of the MS MARCO passage ranking dataset across 14 languages.
  • GermanQuAD is a German dataset constructed similar to the English SQuAD dataset with the German Wikipedia.
  • ViHealthQA is a Vietnamese dataset constructed using questions from health-interested users asked on health websites and answers from highly qualified experts.

🍻 Multilingual Datasets

Language Dataset Website BEIR-Name Type Queries Corpus Rel D/Q Down-load md5
Vietnamese ViHealthQA Homepage vihealthqa train
dev
test
2,013 10K 2.5 Link 7685a360a837934624a8018d826c383f
German GermanQuAD Homepage germanquad test 2,044 2.80M 1.0 Link 95a581c3162d10915a418609bcce851b
Arabic Mr.TyDI Homepage mrtydi/arabic train
dev
test
1,081 2.1M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Bengali Mr.TyDI Homepage mrtydi/bengali train
dev
test
111 304K 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Finnish Mr.TyDI Homepage mrtydi/finnish train
dev
test
1,254 1.9M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Indonesian Mr.TyDI Homepage mrtydi/indonesian train
dev
test
829 1.47M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Japanese Mr.TyDI Homepage mrtydi/japanese train
dev
test
720 7M 1.3 Link 17072d0e1610bd8461d962b8ac560fc5
Korean Mr.TyDI Homepage mrtydi/korean train
dev
test
421 1.5M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Russian Mr.TyDI Homepage mrtydi/russian train
dev
test
995 9.6M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Swahili Mr.TyDI Homepage mrtydi/swahili train
dev
test
670 136K 1.1 Link 17072d0e1610bd8461d962b8ac560fc5
Telugu Mr.TyDI Homepage mrtydi/telugu train
dev
test
646 548K 1.0 Link 17072d0e1610bd8461d962b8ac560fc5
Thai Mr.TyDI Homepage mrtydi/thai train
dev
test
1,190 568K 1.1 Link 17072d0e1610bd8461d962b8ac560fc5

🍻 Translated (Multilingual) Datasets

Language Dataset Website BEIR-Name Type Queries Corpus Rel D/Q Down-load md5
Spanish mMARCO Homepage mmarco/spanish train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
French mMARCO Homepage mmarco/french train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Portuguese mMARCO Homepage mmarco/portuguese train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Italian mMARCO Homepage mmarco/italian train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Indonesian mMARCO Homepage mmarco/indonesian train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
German mMARCO Homepage mmarco/german train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Russian mMARCO Homepage mmarco/russian train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Chinese mMARCO Homepage mmarco/chinese train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7