Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental] Run crosshair in CI #4034

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Zac-HD
Copy link
Member

@Zac-HD Zac-HD commented Jul 7, 2024

See #3914

To reproduce this locally, you can run make check-crosshair-cover/nocover/niche for the same command as in CI, but I'd recommend pytest --hypothesis-profile=crosshair hypothesis-python/tests/{cover,nocover,datetime} -m xf_crosshair --runxfail to select and run only the xfailed tests.

Hypothesis' problems

  • Vast majority of failures are Flaky: Inconsistent results from replaying a failing test... - mostly backend-specific failures; we've both
    • improved reporting in this case to show the crosshair-specific traceback
    • got most of the affected tests passing
  • Invalid internal boolean probability, e.g. "hypothesis/internal/conjecture/data.py", line 2277, in draw_boolean assert p > 2 ** (-64), fixed in 1f845e0 (#4049)
  • many of our test helpers involved nested use of @given, fixed in 3315be6
  • symbolic outside context
  • avoid uninstalling typing_extensions when crosshair depends on it
  • tests which are not really expected to pass on other backends. I'm slowly applying a backend-specific xfail decorator to them, @xfail_on_crosshair(...).
    • tests which expect to raise a healthcheck, and fail because our crosshair profile disables healthchecks. Disable only .too_slow and .filter_too_much, and skip remaining affected tests under crosshair.
    • undo some over-broad skips, e.g. various xfail decorators, pytestmarks, -k 'not decimal' once we're closer
  • provide a special exception type for when running the test or realizing values would hit a PathTimeout; see Rare PathTimeout errors in provider.realize(...) pschanely/hypothesis-crosshair#21 and Stable support for symbolic execution #3914 (comment)
    • and something to signal that we've exhausted Crosshair's ability to explore the test. If this is sound, we've verified the function and can stop! (and should record that in the stop_reason). If unsound, we can continue testing with Hypothesis' default backend - so it's important to distinguish.
      Add BackendCannotProceed to improve integration #4092

Probably Crosshair's problems

Error in operator.eq(Decimal('sNaN'), an_int)

____ test_rewriting_does_not_compare_decimal_snan ____
  File "hypothesis/strategies/_internal/strategies.py", line 1017, in do_filtered_draw
    if self.condition(value):
TypeError: argument must be an integer
while generating 's' from integers(min_value=1, max_value=5).filter(functools.partial(eq, Decimal('sNaN')))

Cases where crosshair doesn't find a failing example but Hypothesis does

Seems fine, there are plenty of cases in the other direction. Tracked with @xfail_on_crosshair(Why.undiscovered) in case we want to dig in later.

Nested use of the Hypothesis engine (e.g. given-inside-given)

This is just explicitly unsupported for now. Hypothesis should probably offer some way for backends to declare that they don't support this, and then raise a helpful error message if you try anyway.

@Zac-HD Zac-HD added tests/build/CI about testing or deployment *of* Hypothesis interop how to play nicely with other packages labels Jul 7, 2024
@tybug

This comment was marked as outdated.

@Zac-HD

This comment was marked as outdated.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 3 times, most recently from 175b347 to 424943f Compare July 7, 2024 20:26
@pschanely
Copy link
Contributor

@Zac-HD your triage above is SO great. I am investigating.

@pschanely
Copy link
Contributor

pschanely commented Jul 8, 2024

Knocked out a few of these in 0.0.60.
I think that means current status on my end is:

  • TypeError: conversion from SymbolicInt to Decimal is not supported
  • Unsupported operand type(s) for -: 'float' and 'SymbolicFloat' in test_float_clamper
  • TypeError: descriptor 'keys' for 'dict' objects doesn't apply to a 'ShellMutableMap' object (or 'values' or 'items').
  • TypeError: _int() got an unexpected keyword argument 'base'
  • Symbolic not realized (in e.g. test_suppressing_filtering_health_check)
  • Error in operator.eq(Decimal('sNaN'), an_int)
  • Zac's cursed example below!

More soon.

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

Ah - the Flaky failures are of course because we had some failure under the Crosshair backend, which did not reproduce under the Hypothesis backend. This is presumably going to point to a range of integration bugs, but is also something that we'll want to clearly explain to users because integration bugs are definitely going to happen in future and users will need to respond (by e.g. using a different backend, ignoring the problem, whatever).

  • improve the reporting around Flaky failures where the differing or missing errors are related to a change of backend while shrinking. See also Change Flaky to be an ExceptionGroup #4040.
  • triage all the current failures so we can fix them

@Zac-HD

This comment was marked as outdated.

@tybug
Copy link
Member

tybug commented Jul 12, 2024

Most/all of the "expected x, got symbolic" errors are symptoms of an underlying error in my experience (often operation on symbolic while not tracing). In this case running with export HYPOTHESIS_NO_TRACEBACK_TRIM=1 reveals limited_category_index_cache in cm.query is at fault.

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 12, 2024

ah-ha, seems like we might want some #4029 - style 'don't cache on backends with avoid_realize=True' logic.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from 1d2345d to 7bf8983 Compare July 12, 2024 20:15
@pschanely
Copy link
Contributor

Still here and excited about this! I am on a detour of doing a real symbolic implementation of the decimal module - should get that out this weekend.

@Zac-HD Zac-HD force-pushed the crosshair-in-ci branch 2 times, most recently from cc07927 to 018ccab Compare July 13, 2024 07:23
@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 13, 2024

Triaging a pile of the Flaky erorrs, most were due to getting a RecursionError under crosshair and then passing under Hypothesis - and it looks like most of those were in turn because of all our nested-@given() test helpers.

So I've tried de-nesting those, which seems to work nicely and even makes things a bit faster by default; and when CI finishes we'll see how much it helps on crosshair 🤞

@Zac-HD

This comment was marked as outdated.

@Zac-HD
Copy link
Member Author

Zac-HD commented Oct 10, 2024

@pschanely huge progress from recent updates! The BackendCannotProceed mechanism entirely fixed several classes of issues, the floats changes have been great (signed zero ftw!), from_type() generates instances more often, I'm no longer skipping categories of stuff, and overall we've dropped from about +350 to +250 lines of code in this PR 🎊

At this point my only real reason to avoid merging is that crosshair updates often cause a fair bit of churn, causing some tests to start failing and some to start xpassing - it's net-good, but would be toil in our CI. I feel like we've crossed from an alpha-version which is a neat proof of concept, to a beta-version which is still early but already both useful and clearly on a path to stability and wider adoption. Incredibly excited about this ✨


If you want to pull out Crosshair issues,

  • this PR is probably useful as a pre-release test, to check whether there are any regressions you didn't expect
  • there's a commit marking some things that look like Crosshair bugs to me, and many more where Crosshair just doesn't find a failure that Hypothesis does (within the test budget, and which might or might not be a problem)
  • there's a commit full of tests skipped because they were very slow, if you want to look at performance issues. I haven't audited it lately but would guess at least a third are still slow + also Crosshair's problem.
  • the last big commit is pretty messy, probably best to ignore that for now

@pschanely
Copy link
Contributor

@pschanely huge progress from recent updates! The BackendCannotProceed mechanism entirely fixed several classes of issues, the floats changes have been great (signed zero ftw!), from_type() generates instances more often, I'm no longer skipping categories of stuff, and overall we've dropped from about +350 to +250 lines of code in this PR 🎊

So great.

At this point my only real reason to avoid merging is that crosshair updates often cause a fair bit of churn, causing some tests to start failing and some to start xpassing - it's net-good, but would be toil in our CI.

Frankly, I'm not sure it makes sense to block hypothesis on a crosshair-related failure, even in a very distant, stable future. Would love your ideas making the integration more "eventually" correct. Maybe a dedicated testing repo that pulls the hypothesis source and has these pytest markers externally applied? (or submodules? but those scare me)

If you want to pull out Crosshair issues,

Always. Thanks for the commit breakdown. More updates soon!

@Zac-HD
Copy link
Member Author

Zac-HD commented Oct 12, 2024

Frankly, I'm not sure it makes sense to block hypothesis on a crosshair-related failure, even in a very distant, stable future. Would love your ideas making the integration more "eventually" correct. Maybe a dedicated testing repo that pulls the hypothesis source and has these pytest markers externally applied? (or submodules? but those scare me)

For clarity, "blocking" would mean 'when we update our pinned dependencies, if Crosshair has changed we'll update the xfail markers accordingly and report any issues upstream, or maybe add a != requirement for that version'. Similarly, if a Hypothesis PR doesn't work with Crosshair I'd prefer to learn that at the time so I can decide to either xfail the tests, or do some extra work to support it - and my guess is that the converse would be useful for you too.

In practice I expect I'll just keep updating this PR for now, and you can grab a local copy of the branch if you want to run the tests before a Crosshair release 😁 (and note the test-selection tips at the top of the pr!)

@pschanely
Copy link
Contributor

For clarity, "blocking" would mean 'when we update our pinned dependencies, if Crosshair has changed we'll update the xfail markers accordingly and report any issues upstream, or maybe add a != requirement for that version'. Similarly, if a Hypothesis PR doesn't work with Crosshair I'd prefer to learn that at the time so I can decide to either xfail the tests, or do some extra work to support it - and my guess is that the converse would be useful for you too.

Fair enough! I was concerned about how much churn in CrossHair pass/fails you'll see for unrelated hypothesis changes, but it's also true that I want to know about what you see. Current plan SGTM.

In practice I expect I'll just keep updating this PR for now, and you can grab a local copy of the branch if you want to run the tests before a Crosshair release 😁 (and note the test-selection tips at the top of the pr!)

Yup! I've been doing this a little already; works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interop how to play nicely with other packages tests/build/CI about testing or deployment *of* Hypothesis
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants