-
Notifications
You must be signed in to change notification settings - Fork 192
Datasets available
The BEIR benchmark (originally) contains 18 retrieval datasets from diverse domains and tasks. The benchmark focuses on zero-shot evaluation (i.e. no training data available) of lexical and neural retrievers across on the diverse datasets. Here below the dataset links and statistics:
- Four private datasets: Send me an email on
nandant@gmail.com
for direct access of these datasets via a private google drive link. Please make sure you have the necessary licenses involved before you send in the email.
To load one of the already preprocessed datasets in your current directory as follows:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
This will download the scifact
dataset under the datasets
directory.
Command to generate md5hash using Terminal: md5hash filename.zip
.
Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
---|---|---|---|---|---|---|---|---|
MSMARCO | Homepage | msmarco |
train dev test
|
6,980 | 8.84M | 1.1 | Link | 444067daf65d982533ea17ebd59501e4 |
TREC-COVID | Homepage | trec-covid |
test |
50 | 171K | 493.5 | Link | ce62140cb23feb9becf6270d0d1fe6d1 |
NFCorpus | Homepage | nfcorpus |
train dev test
|
323 | 3.6K | 38.2 | Link | a89dba18a62ef92f7d323ec890a0d38d |
BioASQ | Homepage | bioasq |
train test
|
500 | 14.91M | 8.05 | No | How to Reproduce? |
NQ | Homepage | nq |
train test
|
3,452 | 2.68M | 1.2 | Link | d4d3d2e48787a744b6f6e691ff534307 |
HotpotQA | Homepage | hotpotqa |
train dev test
|
7,405 | 5.23M | 2.0 | Link | f412724f78b0d91183a0e86805e16114 |
FiQA-2018 | Homepage | fiqa |
train dev test
|
648 | 57K | 2.6 | Link | 17918ed23cd04fb15047f73e6c3bd9d9 |
Signal-1M(RT) | Homepage | signal1m |
test |
97 | 2.86M | 19.6 | No | How to Reproduce? |
TREC-NEWS | Homepage | trec-news |
test |
57 | 595K | 19.6 | No | How to Reproduce? |
ArguAna | Homepage | arguana |
test |
1,406 | 8.67K | 1.0 | Link | 8ad3e3c2a5867cdced806d6503f29b99 |
Touche-2020 | Homepage | webis-touche2020 |
test |
49 | 382K | 19.0 | Link | 46f650ba5a527fc69e0a6521c5a23563 |
CQADupstack | Homepage | cqadupstack |
test |
13,145 | 457K | 1.4 | Link | 4e41456d7df8ee7760a7f866133bda78 |
Quora | Homepage | quora |
dev test
|
10,000 | 523K | 1.6 | Link | 18fb154900ba42a600f84b839c173167 |
DBPedia | Homepage | dbpedia-entity |
dev test
|
400 | 4.63M | 38.2 | Link | c2a39eb420a3164af735795df012ac2c |
SCIDOCS | Homepage | scidocs |
test |
1,000 | 25K | 4.9 | Link | 38121350fc3a4d2f48850f6aff52e4a9 |
FEVER | Homepage | fever |
train dev test
|
6,666 | 5.42M | 1.2 | Link | 5a818580227bfb4b35bb6fa46d9b6c03 |
Climate-FEVER | Homepage | climate-fever |
test |
1,535 | 5.42M | 3.0 | Link | 8b66f0a9126c521bae2bde127b4dc99d |
SciFact | Homepage | scifact |
train test
|
300 | 5K | 1.1 | Link | 5f7d1de60b170fc8027bb7898e2efca1 |
Robust04 | Homepage | robust04 |
test |
249 | 528K | 69.9 | No | How to Reproduce? |
Similar to Tensorflow datasets or HuggingFace's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.
If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, feel free to post an issue here or make a pull request!
If you're a dataset owner and wish to include your dataset or model in this library, feel free to post an issue here or make a pull request!