MS MARCO v2.1 and v2.1 segmented for TREC 2024 RAG #267

mam10eks · 2024-06-21T09:53:43Z

Dataset Information:

It would be awesome to have the document corpus (and the segmented counterpart) used in TREC RAG 2024 as integration to ir_datasets. From the description on the web page, it should be no problem to add this, random access to documents should also be very efficient as the file and byte offset are already encoded in the document identifiers, so I think there should be no problem.

The only question that I would have is: As the document identifiers contain the offsets in the file where a document starts (but not the end), is there maybe already a functionality that seeks to the start and readys the json entry until the closing bracket? If not, I could add this as well with unit tests, should be no problem.

Links to Resources:

The dataset description: https://trec-rag.github.io/annoucements/2024-corpus-finalization/
The document dataset (28GB): https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2.1_doc.tar
The segmented document dataset (25GB): https://msmarco.z22.web.core.windows.net/msmarcoranking/msmarco_v2.1_doc_segmented.tar

Dataset ID(s) & supported entities:

msmarco-document-v2.1: for the original documents
msmarco-document-v2.1/segmented: for the segmented documents

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Dataset definition (in ir_datasets/datasets/[topid].py)
Tests (in tests/integration/[topid].py)
Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
Documentation (in ir_datasets/etc/[topid].yaml)
- Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
Downloadable content (in ir_datasets/etc/downloads.json)
- Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

The text was updated successfully, but these errors were encountered:

mam10eks · 2024-06-21T09:54:25Z

Dear all, I would be open to make a first proposal for an implementation here.

mam10eks · 2024-08-05T17:18:23Z

Dear all, I started a draft pull request (only to indicate that there is some progress): #269

Mainly documentation todos are pending, but as the deadline is close, this might be already useful for others even when the documentation is not yet finalized.

I.e., the main thing for iterating over documents could be already done via (e.g., as covered in the unit tests):

for doc in ir_datasets.load('msmarco-document-v2.1/segmented').docs_iter():
    print(doc)
    break

seanmacavaney · 2024-08-06T07:30:03Z

Awesome, thanks! I'll take a look at it tomorrow and see if I can tick some of the other tasks :)

mam10eks added the add-dataset label Jun 21, 2024

mam10eks mentioned this issue Aug 5, 2024

Add msmarco v2.1 trec rag #269

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MS MARCO v2.1 and v2.1 segmented for TREC 2024 RAG #267

MS MARCO v2.1 and v2.1 segmented for TREC 2024 RAG #267

mam10eks commented Jun 21, 2024 •

edited by seanmacavaney

Loading

mam10eks commented Jun 21, 2024

mam10eks commented Aug 5, 2024

seanmacavaney commented Aug 6, 2024

MS MARCO v2.1 and v2.1 segmented for TREC 2024 RAG #267

MS MARCO v2.1 and v2.1 segmented for TREC 2024 RAG #267

Comments

mam10eks commented Jun 21, 2024 • edited by seanmacavaney Loading

mam10eks commented Jun 21, 2024

mam10eks commented Aug 5, 2024

seanmacavaney commented Aug 6, 2024

mam10eks commented Jun 21, 2024 •

edited by seanmacavaney

Loading