-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TREC CAST #6
Comments
Looks like it uses TREC CAR v2, so depends on #5 |
The task is happening again in 2021: https://trec.nist.gov/pubs/call2021.html |
|
Related to #80 |
Proposed structure:
So, to get this going, we need to:
Then have a component that merges and dedupes the corpus (per the dedup files). After this, the topics and qrels should be easy. |
Progress made on this branch. Noticed that WaPo v2 wasn't used for evaluation on 2019, so it should be removed. The tricky bit is now that the
|
Getting closer to adding CAsT 2021 with the addition of KILT in #161 |
Started to work again on the integration of CAST into ir_datasets. The dependence on Spacy to reproduce the splitting done in CaST is overly complex in the current branch. I started to work on an alternate solution that uses the official splits to match the original documents. This would require storing the offset files (along with hashes to be on the safe side) on some servers, but the advantage would be to get rid of the dependence on Spacy. The offset file looks like this ...
{"id": "KILT_20189", "ranges": [[[0, 1338]], [[1341, 2437]], [[2440, 3682]], [[3685, 5023]], [[5026, 6439]], [[6442, 7670]], [[7672, 8444]], [[8447, 10270]], [[7672, 8444], [10273, 10794]], [[10796, 12094]], [[12096, 13437]], [[13440, 14750]], [[14752, 15808]], [[15810, 17226]], [[17228, 18461]], [[18465, 19862]], [[19865, 21125]], [[21127, 22422]], [[22424, 23794]], [[23796, 25072]], [[25074, 26118]], [[26120, 27370]], [[27372, 28645]], [[28647, 29884]], [[29886, 31213]], [[31215, 32461]], [[32463, 33458]], [[33460, 34733]], [[34735, 36013]], [[36015, 37100]], [[37102, 38434]], [[38437, 39293]], [[39296, 40513]], [[40516, 41636]], [[41639, 42761]], [[42764, 43634]]], "md5": "06058b7a8193d0cd9f1d5139abf36263"}
... that specifies the offsets of the different passages composing the document When processing with |
Awesome, I like this approach a lot. It seems like a perfect compromise that allows the files to be downloaded from the original source (or obtained through the proper channels, in the case of wapo) while also avoiding the complexity of spacy-dependence. Bravo! |
OK, so I will continue in this direction. Is there a storage location for |
There are several options. How big do you expect the offset files to be? They might be able to fit on mirror.ir-datasets.com (hosted via a github site): https://github.com/seanmacavaney/irds-mirror/ If not, they could probably go up on huggingface. |
For the offset files, the total will be around 1.4GB (227M x 2 for KILT 2021 and 2022, 183M for MS Marco V1, 550M for MS Marco V2, 41M for WAPO). What should I do with the Python script that generates them (could be useful for reference)? I also started a pull request #255 waiting for your comments |
For conversational AI. http://www.treccast.ai/
Documents: Uses MS-MARCO, TREC CAR, and Washington Post collections.
Also includes a list of duplicate files, due to the combination of collections, it seems.
Queries/qrels:
Queries in sequence. 30 training for Y1, 50 testing for Y1, 50 testing for Y2.
Queries are in sequence. The sequence can be encoded as another field in the
Query text includes #combine() syntax.
The text was updated successfully, but these errors were encountered: