Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TREC CAST #6

Open
seanmacavaney opened this issue Nov 8, 2020 · 12 comments
Open

TREC CAST #6

seanmacavaney opened this issue Nov 8, 2020 · 12 comments

Comments

@seanmacavaney
Copy link
Collaborator

For conversational AI. http://www.treccast.ai/

Documents: Uses MS-MARCO, TREC CAR, and Washington Post collections.

Also includes a list of duplicate files, due to the combination of collections, it seems.

Queries/qrels:
Queries in sequence. 30 training for Y1, 50 testing for Y1, 50 testing for Y2.

Queries are in sequence. The sequence can be encoded as another field in the

Query text includes #combine() syntax.

@seanmacavaney
Copy link
Collaborator Author

Looks like it uses TREC CAR v2, so depends on #5

@seanmacavaney
Copy link
Collaborator Author

The task is happening again in 2021: https://trec.nist.gov/pubs/call2021.html

@seanmacavaney
Copy link
Collaborator Author

wapo collection added for #51

@seanmacavaney
Copy link
Collaborator Author

Related to #80

@seanmacavaney
Copy link
Collaborator Author

Proposed structure:

trec-cast # placeholder
trec-cast/2019 # corpus: MSMARCOv1 + CARv2 + WaPOv2 (split by paragraph)  (MSMARCO&WaPo deduped per provided files)
trec-cast/2019/train # limited set of training topics provided
trec-cast/2019/train/judged # limited set of training topics provided, filtered down to only judged ones
trec-cast/2019/eval
trec-cast/2020 # corpus: MSMARCOv1 + CARv2  (any dedup??)
trec-cast/2021 # corpus: MSMARCOv1 + WaPo2020 + KILT  (dedup)

So, to get this going, we need to:

Then have a component that merges and dedupes the corpus (per the dedup files).

After this, the topics and qrels should be easy.

@seanmacavaney
Copy link
Collaborator Author

Progress made on this branch.

Noticed that WaPo v2 wasn't used for evaluation on 2019, so it should be removed. The tricky bit is now that the 2019/train and 2019/train/judged do use WaPo v2, but 2019/eval does not. What to do... Give the different corpora v1, v2, ... names, as was done for PMC?

trec-cast # placeholder
trec-cast/v0 # corpus: MSMARCOv1 + CARv2 + WaPOv2 (split by paragraph)  (MSMARCO&WaPo deduped per provided files)
trec-cast/v0/train # limited set of training topics provided
trec-cast/v0/train/judged # limited set of training topics provided, filtered down to only judged ones
trec-cast/v1 # corpus: MSMARCOv1 + CARv2 
trec-cast/v1/2019
trec-cast/v1/2020
trec-cast/v2 # corpus: MSMARCOv1 + WaPo2020 + KILT  (dedup)
trec-cast/v2/2021

@seanmacavaney
Copy link
Collaborator Author

Getting closer to adding CAsT 2021 with the addition of KILT in #161

@seanmacavaney seanmacavaney mentioned this issue Feb 25, 2022
8 tasks
@bpiwowar
Copy link
Contributor

Started to work again on the integration of CAST into ir_datasets.

The dependence on Spacy to reproduce the splitting done in CaST is overly complex in the current branch. I started to work on an alternate solution that uses the official splits to match the original documents. This would require storing the offset files (along with hashes to be on the safe side) on some servers, but the advantage would be to get rid of the dependence on Spacy.

The offset file looks like this

...
{"id": "KILT_20189", "ranges": [[[0, 1338]], [[1341, 2437]], [[2440, 3682]], [[3685, 5023]], [[5026, 6439]], [[6442, 7670]], [[7672, 8444]], [[8447, 10270]], [[7672, 8444], [10273, 10794]], [[10796, 12094]], [[12096, 13437]], [[13440, 14750]], [[14752, 15808]], [[15810, 17226]], [[17228, 18461]], [[18465, 19862]], [[19865, 21125]], [[21127, 22422]], [[22424, 23794]], [[23796, 25072]], [[25074, 26118]], [[26120, 27370]], [[27372, 28645]], [[28647, 29884]], [[29886, 31213]], [[31215, 32461]], [[32463, 33458]], [[33460, 34733]], [[34735, 36013]], [[36015, 37100]], [[37102, 38434]], [[38437, 39293]], [[39296, 40513]], [[40516, 41636]], [[41639, 42761]], [[42764, 43634]]], "md5": "06058b7a8193d0cd9f1d5139abf36263"}
...

that specifies the offsets of the different passages composing the document KILT_20189.

When processing with ir_datasets, using the offset file along with the original files allows to recover the CAsT splitting (although spaces introduced by spacy are lost, but I don't think this is a big deal, on the contrary)

@seanmacavaney
Copy link
Collaborator Author

Awesome, I like this approach a lot. It seems like a perfect compromise that allows the files to be downloaded from the original source (or obtained through the proper channels, in the case of wapo) while also avoiding the complexity of spacy-dependence. Bravo!

@bpiwowar
Copy link
Contributor

OK, so I will continue in this direction. Is there a storage location for ir_datasets related files?

@seanmacavaney
Copy link
Collaborator Author

There are several options. How big do you expect the offset files to be? They might be able to fit on mirror.ir-datasets.com (hosted via a github site): https://github.com/seanmacavaney/irds-mirror/

If not, they could probably go up on huggingface.

@bpiwowar
Copy link
Contributor

bpiwowar commented Jan 27, 2024

For the offset files, the total will be around 1.4GB (227M x 2 for KILT 2021 and 2022, 183M for MS Marco V1, 550M for MS Marco V2, 41M for WAPO).

What should I do with the Python script that generates them (could be useful for reference)?

I also started a pull request #255 waiting for your comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants