Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TREC CaST #255

Merged
merged 38 commits into from
May 24, 2024
Merged

TREC CaST #255

merged 38 commits into from
May 24, 2024

Conversation

bpiwowar
Copy link
Contributor

@bpiwowar bpiwowar commented Jan 29, 2024

This pull request is for TREC CaST 2019 to 2022

Generic classes

For the moment, it contains generic handler classes that might be moved in other places (and need to be tested):

  • PrefixedDocs that allows to use document collections that are merged using a collection-specific prefix to the ID of each document
  • DocsSubset and Dupes: handle duplicates in any document collection

Cleanup before merge

I suggest the following steps before merging with master:

  • WAPO in v3 (2021) has no clearly associated duplicate file (as it is the case in 2022), so we have two nearly identical collections. Is this right?
  • Decide where to put the file that computes the offsets (in case it is needed later)
  • Test the generic classes
    • Test prefixed documents
    • Test subsets
  • Use URLs for offset files in downloads.json (so move them)
  • Set the real number of documents

@bpiwowar bpiwowar mentioned this pull request Jan 29, 2024
@seanmacavaney
Copy link
Collaborator

Great, thanks!

I won't have much time to contribute directly to this for the next week or so.

Regarding the tasks:

WAPO in v3 (2021) has no clearly associated duplicate file (as it is the case in 2022), so we have two nearly identical collections. Is this right?

This is my understanding, yes. We can prefix each collection by year.

Decide where to put the file that computes the offsets (in case it is needed later)

My vote is to put it on huggingface, and potentially include a backup location too. I can add it under https://huggingface.co/irds.

@bpiwowar
Copy link
Contributor Author

OK this is done (uploaded offset files on HF).

I see you support python 3.7, is it still ongoing or are you planning to switch to 3.8 ?

@seanmacavaney
Copy link
Collaborator

Given that 3.7's reached End of Service, I think it's reasonable to bump the minimum Python version to 3.8. Especially if there's features from the core library that you want to use from 3.8.

@bpiwowar
Copy link
Contributor Author

OK great, should I modify the CI myself or will you do it in the main branch

Remaining quick question (for now): should I move the generic classes outside of trec_cast, and if so, where (see first post in PR)?

I will test the PR them a bit thoroughly by testing my models on TREC CaST so no emergency

@seanmacavaney
Copy link
Collaborator

The generic classes can move outside the cast directory. I'm keen to apply them in other settings.

@seanmacavaney
Copy link
Collaborator

You can modify the CI in this branch, that's alright.

@bpiwowar
Copy link
Contributor Author

Can you give "write" access temporarily to https://huggingface.co/irds (so I can transfer the ownership) or otherwise copy the files from https://huggingface.co/datasets/bpiwowar/trec_cast_offsets/tree/main

@seanmacavaney
Copy link
Collaborator

Can you give "write" access temporarily

Done!

@seanmacavaney
Copy link
Collaborator

I'm having trouble running some of the tests locally:

/opt/miniconda3/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
test/util/docs/test_subset.py:2: in <module>
    from .data import FakeDocs
E   ModuleNotFoundError: No module named 'docs.data'

Is there a missing test/util/docs/data.py file?

@bpiwowar
Copy link
Contributor Author

I modified the code so the LZ4 store is built (I kept the "on-the-fly" code though since for simple cases, where only a prefix is added, this might still be useful)

@seanmacavaney
Copy link
Collaborator

Thanks! I'm building the tests and such now :)

@seanmacavaney
Copy link
Collaborator

I think it's mostly there!

I'm only having some problems with v2. First, when I try to do a lookup, it fails in DocsSubset.docs_count, I think since the docs_count() isn't known for a filtered dataset. Second, although I'm not super familiar with CAST, I think the v2 corpus return individual passages instead of entire documents? [specification here] Otherwise the qrels won't align with the doc_ids returned.

@bpiwowar
Copy link
Contributor Author

Great!

For v2 (if I understood correctly), assessments are at the document level - even though there is an official passage corpus. I added a v2/passages that concatenates all the passages from the three datasets.

Can you give me the command for the lookup you are testing with?

@andreaschari
Copy link
Contributor

Ran both integration tests and the tests in util/docs.
I have 2 out of 3 util/docs tests failing:

  1. FAILED test/util/docs/test_multiple.py::test_multiple_prefixes - NameError: name 'OtherDoc' is not defined

Self-explanatory, the OtherDoc being set as docs_cls in line 47 of test_multiple.py isn't defined anywhere in the test code.

  1. FAILED test/util/docs/test_multiple.py::test_multiple_prefixes_inlined - AttributeError: 'NoneType' object has no attribute '_docs_cls'

Error's coming from line 74: assert all_docs.docs_cls() == GenericDoc

As for the integration tests:

FAILED test/integration/trec_cast.py::TestTrecCast::test_queries - AssertionError: '' != None

@bpiwowar
Copy link
Contributor Author

OK, the integration tests should be fixed now

@andreaschari
Copy link
Contributor

The tests in test_multiple.py are passing now! But still not passing TestTrecCast::test_queries:

FAILED test/integration/trec_cast.py::TestTrecCast::test_queries - AssertionError: '' != None

This is the entire error trace:

test/integration/trec_cast.py:61: 
test/integration/base.py:61: in _test_queries
    self._assert_namedtuple(query, items[i])                                                                                                     
test/integration/base.py:266: in _assert_namedtuple  
    self.assertEqual(v_a, v_b) 
E   AssertionError: '' != None

@bpiwowar
Copy link
Contributor Author

The tests should run now

@seanmacavaney seanmacavaney merged commit 6a429ed into allenai:master May 24, 2024
1 check passed
@seanmacavaney
Copy link
Collaborator

Thanks @bpiwowar and @andreaschari !

@fschlatt fschlatt mentioned this pull request Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants