Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TREC CaST #255

Merged
merged 38 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
5a98598
cast wip
seanmacavaney Jun 1, 2022
b25087c
Merge remote-tracking branch 'origin/master' into cast
seanmacavaney Jun 8, 2022
40db27f
wip
seanmacavaney Jun 13, 2022
c24f669
Merge branch 'master' into cast
bpiwowar Dec 26, 2023
58e4839
Added queries_cls in filtered queries
bpiwowar Jan 29, 2024
cca69fe
WIP: CaST 2019-22
bpiwowar Jan 29, 2024
eb512f3
Merge remote-tracking branch 'origin' into cast
bpiwowar Jan 29, 2024
d1b0c38
Moved some generic classes with trec_cast
bpiwowar Jan 29, 2024
4f839ef
Fix paths and bugs
bpiwowar Jan 30, 2024
b3cff16
Re-organized trec-cast
bpiwowar Jan 30, 2024
b04bfe0
Removed support for Python 3.7
bpiwowar Jan 30, 2024
87e3f98
Test and fix for prefixed documents
bpiwowar Jan 30, 2024
880adc0
Fix docstore for multiple
bpiwowar Jan 30, 2024
1ad3209
Corrected hash for KILT offsets
bpiwowar Jan 30, 2024
bc7323c
Fixes and changes to prefixed documents
bpiwowar Jan 31, 2024
bea6c4c
Moved offsets to irds HF space
bpiwowar Jan 31, 2024
4d38c07
Updated test
bpiwowar Feb 1, 2024
21226e7
Added more tests on prefixed datasets
bpiwowar Feb 1, 2024
32d8c33
Test subset
bpiwowar Feb 1, 2024
93ddc55
More multiple test
bpiwowar Feb 1, 2024
5d85f70
Added data.py:
bpiwowar Feb 15, 2024
4602bd4
Materializes prefixed documents
bpiwowar Feb 22, 2024
cd6bd7e
update tests
seanmacavaney Feb 26, 2024
e7cb468
CaST: added passages sub-dataset when appropriate
bpiwowar Feb 26, 2024
5093e5e
Merge branch 'cast' of github.com:bpiwowar/ir_datasets into cast
bpiwowar Feb 26, 2024
51ca0b4
Fix wrong prefix in passages
bpiwowar Feb 26, 2024
ec7a7d0
Removed print
bpiwowar Feb 26, 2024
c6a3c9f
Small fixes
bpiwowar Mar 19, 2024
08584e5
submodule should be included in the distribution
bpiwowar Mar 19, 2024
82e2131
Fixed some bugs
bpiwowar May 11, 2024
cf05628
Merge remote-tracking branch 'origin' into cast
bpiwowar May 11, 2024
0785cbe
fix: bug in CaST 2022 queries
bpiwowar May 14, 2024
d9a718b
fix: bug in multiple docs
bpiwowar May 14, 2024
59f830f
fix: use enum value to get attribute name
bpiwowar May 17, 2024
cdc8fa2
Merge remote-tracking branch 'origin' into fix/count
bpiwowar May 17, 2024
0141f0f
Merge branch 'fix/count' into cast
bpiwowar May 17, 2024
e413eec
tests: fix the tests
bpiwowar May 21, 2024
3d74469
fixed TREC CaST tests
bpiwowar May 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.7, 3.8, 3.9, '3.10']
python-version: ['3.8', '3.9', '3.10']
os: ['ubuntu-latest', 'windows-latest', 'macOs-latest']
architecture: ['x64']

Expand Down
3 changes: 3 additions & 0 deletions ir_datasets/datasets/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,6 +303,9 @@ def queries_iter(self):
if operator(query):
yield query

def queries_cls(self):
return self._queries_handler.queries_cls()

def queries_handler(self):
return self

Expand Down
47 changes: 25 additions & 22 deletions ir_datasets/datasets/kilt.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,28 +66,26 @@ def __init__(self, streamer, count_hint=None):

@ir_datasets.util.use_docstore
def docs_iter(self):
with self._streamer.stream() as stream:
for doc in stream:
doc = json.loads(doc)
yield KiltDoc(
doc['wikipedia_id'],
doc['wikipedia_title'],
''.join(strip_markup(t) for t in doc['text']),
tuple(doc['text']),
tuple(KiltDocAnchor(
a['text'],
a['href'],
a['paragraph_id'],
a['start'],
a['end']) for a in doc['anchors']),
tuple(doc['categories'].split(',')),
doc.get('wikidata_info', {}).get('wikidata_id', ''),
str(doc['history']['revid']),
doc['history']['timestamp'],
str(doc['history']['parentid']),
str(doc['history']['pageid']),
doc['history']['url'],
)
for doc in self.docs_kilt_raw_iter():
yield KiltDoc(
doc['wikipedia_id'],
doc['wikipedia_title'],
''.join(strip_markup(t) for t in doc['text']),
tuple(doc['text']),
tuple(KiltDocAnchor(
a['text'],
a['href'],
a['paragraph_id'],
a['start'],
a['end']) for a in doc['anchors']),
tuple(doc['categories'].split(',')),
doc.get('wikidata_info', {}).get('wikidata_id', ''),
str(doc['history']['revid']),
doc['history']['timestamp'],
str(doc['history']['parentid']),
str(doc['history']['pageid']),
doc['history']['url'],
)

def docs_cls(self):
return KiltDoc
Expand All @@ -112,6 +110,11 @@ def docs_namespace(self):
def docs_lang(self):
return 'en'

def docs_kilt_raw_iter(self):
with self._streamer.stream() as stream:
for doc in stream:
yield json.loads(doc)


def _init():
base_path = ir_datasets.util.home_path()/NAME
Expand Down
Loading