TREC CAST #6

seanmacavaney · 2020-11-08T18:52:39Z

For conversational AI. http://www.treccast.ai/

Documents: Uses MS-MARCO, TREC CAR, and Washington Post collections.

Also includes a list of duplicate files, due to the combination of collections, it seems.

Queries/qrels:
Queries in sequence. 30 training for Y1, 50 testing for Y1, 50 testing for Y2.

Queries are in sequence. The sequence can be encoded as another field in the

Query text includes #combine() syntax.

seanmacavaney · 2020-11-08T19:31:27Z

Looks like it uses TREC CAR v2, so depends on #5

seanmacavaney · 2021-03-03T12:06:20Z

The task is happening again in 2021: https://trec.nist.gov/pubs/call2021.html

seanmacavaney · 2021-03-31T11:44:09Z

wapo collection added for #51

seanmacavaney · 2021-06-24T13:10:45Z

Related to #80

seanmacavaney · 2021-12-12T12:48:01Z

Proposed structure:

trec-cast # placeholder
trec-cast/2019 # corpus: MSMARCOv1 + CARv2 + WaPOv2 (split by paragraph)  (MSMARCO&WaPo deduped per provided files)
trec-cast/2019/train # limited set of training topics provided
trec-cast/2019/train/judged # limited set of training topics provided, filtered down to only judged ones
trec-cast/2019/eval
trec-cast/2020 # corpus: MSMARCOv1 + CARv2  (any dedup??)
trec-cast/2021 # corpus: MSMARCOv1 + WaPo2020 + KILT  (dedup)

So, to get this going, we need to:

Add KILT (KILT #80)
Add CARv2 (car-v2 #5)
Add WaPO2020 (TREC News #43)

Then have a component that merges and dedupes the corpus (per the dedup files).

After this, the topics and qrels should be easy.

seanmacavaney · 2022-01-04T13:56:41Z

Progress made on this branch.

Noticed that WaPo v2 wasn't used for evaluation on 2019, so it should be removed. The tricky bit is now that the 2019/train and 2019/train/judged do use WaPo v2, but 2019/eval does not. What to do... Give the different corpora v1, v2, ... names, as was done for PMC?

trec-cast # placeholder
trec-cast/v0 # corpus: MSMARCOv1 + CARv2 + WaPOv2 (split by paragraph)  (MSMARCO&WaPo deduped per provided files)
trec-cast/v0/train # limited set of training topics provided
trec-cast/v0/train/judged # limited set of training topics provided, filtered down to only judged ones
trec-cast/v1 # corpus: MSMARCOv1 + CARv2 
trec-cast/v1/2019
trec-cast/v1/2020
trec-cast/v2 # corpus: MSMARCOv1 + WaPo2020 + KILT  (dedup)
trec-cast/v2/2021

seanmacavaney · 2022-02-25T21:55:51Z

Getting closer to adding CAsT 2021 with the addition of KILT in #161

bpiwowar · 2024-01-25T21:42:03Z

Started to work again on the integration of CAST into ir_datasets.

The dependence on Spacy to reproduce the splitting done in CaST is overly complex in the current branch. I started to work on an alternate solution that uses the official splits to match the original documents. This would require storing the offset files (along with hashes to be on the safe side) on some servers, but the advantage would be to get rid of the dependence on Spacy.

The offset file looks like this

...
{"id": "KILT_20189", "ranges": [[[0, 1338]], [[1341, 2437]], [[2440, 3682]], [[3685, 5023]], [[5026, 6439]], [[6442, 7670]], [[7672, 8444]], [[8447, 10270]], [[7672, 8444], [10273, 10794]], [[10796, 12094]], [[12096, 13437]], [[13440, 14750]], [[14752, 15808]], [[15810, 17226]], [[17228, 18461]], [[18465, 19862]], [[19865, 21125]], [[21127, 22422]], [[22424, 23794]], [[23796, 25072]], [[25074, 26118]], [[26120, 27370]], [[27372, 28645]], [[28647, 29884]], [[29886, 31213]], [[31215, 32461]], [[32463, 33458]], [[33460, 34733]], [[34735, 36013]], [[36015, 37100]], [[37102, 38434]], [[38437, 39293]], [[39296, 40513]], [[40516, 41636]], [[41639, 42761]], [[42764, 43634]]], "md5": "06058b7a8193d0cd9f1d5139abf36263"}
...

that specifies the offsets of the different passages composing the document KILT_20189.

When processing with ir_datasets, using the offset file along with the original files allows to recover the CAsT splitting (although spaces introduced by spacy are lost, but I don't think this is a big deal, on the contrary)

seanmacavaney · 2024-01-25T22:28:00Z

Awesome, I like this approach a lot. It seems like a perfect compromise that allows the files to be downloaded from the original source (or obtained through the proper channels, in the case of wapo) while also avoiding the complexity of spacy-dependence. Bravo!

bpiwowar · 2024-01-26T06:42:21Z

OK, so I will continue in this direction. Is there a storage location for ir_datasets related files?

seanmacavaney · 2024-01-26T09:26:59Z

There are several options. How big do you expect the offset files to be? They might be able to fit on mirror.ir-datasets.com (hosted via a github site): https://github.com/seanmacavaney/irds-mirror/

If not, they could probably go up on huggingface.

bpiwowar · 2024-01-27T09:57:46Z

For the offset files, the total will be around 1.4GB (227M x 2 for KILT 2021 and 2022, 183M for MS Marco V1, 550M for MS Marco V2, 41M for WAPO).

What should I do with the Python script that generates them (could be useful for reference)?

I also started a pull request #255 waiting for your comments

seanmacavaney added the add-dataset label Nov 8, 2020

seanmacavaney added a commit that referenced this issue Jan 22, 2022

Re-organisation of TREC CAsT dataset IDs per #6

9d18139

seanmacavaney mentioned this issue Jan 22, 2022

TREC CAsT 2019, 2020 #156

Merged

seanmacavaney mentioned this issue Feb 25, 2022

TREC CAsT 2022 #166

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TREC CAST #6

TREC CAST #6

seanmacavaney commented Nov 8, 2020

seanmacavaney commented Nov 8, 2020

seanmacavaney commented Mar 3, 2021

seanmacavaney commented Mar 31, 2021

seanmacavaney commented Jun 24, 2021

seanmacavaney commented Dec 12, 2021

seanmacavaney commented Jan 4, 2022

seanmacavaney commented Feb 25, 2022

bpiwowar commented Jan 25, 2024

seanmacavaney commented Jan 25, 2024

bpiwowar commented Jan 26, 2024

seanmacavaney commented Jan 26, 2024

bpiwowar commented Jan 27, 2024 •

edited

Loading

TREC CAST #6

TREC CAST #6

Comments

seanmacavaney commented Nov 8, 2020

seanmacavaney commented Nov 8, 2020

seanmacavaney commented Mar 3, 2021

seanmacavaney commented Mar 31, 2021

seanmacavaney commented Jun 24, 2021

seanmacavaney commented Dec 12, 2021

seanmacavaney commented Jan 4, 2022

seanmacavaney commented Feb 25, 2022

bpiwowar commented Jan 25, 2024

seanmacavaney commented Jan 25, 2024

bpiwowar commented Jan 26, 2024

seanmacavaney commented Jan 26, 2024

bpiwowar commented Jan 27, 2024 • edited Loading

bpiwowar commented Jan 27, 2024 •

edited

Loading