Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster default runs #246

Merged
merged 10 commits into from
Apr 3, 2024
Merged

Faster default runs #246

merged 10 commits into from
Apr 3, 2024

Conversation

honno
Copy link
Member

@honno honno commented Mar 22, 2024

Some changes that significantly speed-up the run of the test suite:

  • Defaults max examples to 20
  • Only runs one example for main special case tests
  • More tests appropiately marked with unvectorized

And set max examples default to 20
* Set max examples default to 20
  * For faster default runs, over more test quality
* Remove unreachable loading default settings path
@honno honno requested a review from asmeurer March 22, 2024 11:32
@honno
Copy link
Member Author

honno commented Mar 22, 2024

(Seems my special case changes broke some things, couldn't figure it out now so will have a look next week 😅)

@@ -15,7 +15,6 @@

from reporting import pytest_metadata, pytest_json_modifyreport, add_extra_json_metadata # noqa

settings.register_profile("xp_default", deadline=800)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the deadline without this here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

200 good catch, Ill make it default to 800 again!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to think about deadlines a bit more (but in a separate issue). Deadline errors are very common, even with 800. It's not just with JAX, I see them with every library. I disabled the deadline in the array-api-compat and array-api-strict CI. It looks like numpy does too. I think there's just a fundamental incongruity with the hypothesis deadline and the arrays strategy and the test suite.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's just a fundamental incongruity with the hypothesis deadline and the arrays strategy and the test suite.

Yep thats a sound observation. For now I've made it default to 800 like before, and we can mull it over.

@asmeurer
Copy link
Member

Need to update the README for the new max-examples default.

@honno honno requested a review from asmeurer April 2, 2024 12:21
@honno
Copy link
Member Author

honno commented Apr 3, 2024

Nothing should be controversial now so will merge as nice to get in.

@honno honno merged commit 3cf8ef6 into data-apis:master Apr 3, 2024
4 checks passed
@asmeurer
Copy link
Member

asmeurer commented Apr 8, 2024

Hmm, the tests run suspiciously fast now (under a minute). Are we really still testing everything as rigorously as we used to? I'm all for faster runs, but we also want to make sure that actual failures are caught, ideally without having to rerun the tests multiple times.

@honno
Copy link
Member Author

honno commented Apr 9, 2024

Hmm, the tests run suspiciously fast now (under a minute). Are we really still testing everything as rigorously as we used to? I'm all for faster runs, but we also want to make sure that actual failures are caught, ideally without having to rerun the tests multiple times.

I think we're striking a nice balance. The CI job is an extreme example as array_api_strict/NumPy is fairly Hypothesis-friendly and so the tests run quickly, plus we have a skips file for any failures. Something like jax.experimental.array_api still takes ~7mins to run locally, where array_api_strict takes 20 seconds... that is a rather big jump tbf, but admittedly JAX has more errors too which really contribute to the timings.

@asmeurer
Copy link
Member

asmeurer commented Apr 9, 2024

Is this primarily coming from vectorization? If we're really running this fast, I don't see the need to reduce the default number of examples.

@honno
Copy link
Member Author

honno commented Apr 10, 2024

Is this primarily coming from vectorization?

Vectorization definitely helps, it's also that we're now reducing the max examples for the tests which aren't vectorized, plus only using a single example for the special case tests.

If we're really running this fast, I don't see the need to reduce the default number of examples.

I think the slowest adoptor1 should be our lowest common denominator for how we set default runs of the test suite, i.e. JAX is taking 7mins (which even then is longer than ideal for a default run IMO). Then we can leave it to users—and ourselves—to bump the max examples for CI jobs.


On a general note on examples: in my experience it's often only a few simple examples that cause failures. So the super interesting examples that come with a larger max are not that rewarding for the significant time add.

Footnotes

  1. not to say JAX is slow, more-so how it interplays with how the test suite works. The biggest element is error reporting is slow because Hypothesis stresses through the code its testing to make cleaner and reproducible examples.

@rgommers
Copy link
Member

Agreed with @honno's comments above - default number of examples should be optimized for fast test suite execution time, and having a high number of examples really isn't very interesting anyway. I'd honestly set it to 1, not 20. 1 is regular CI "is everything implemented and does it all work". 20/200/inf is mostly repeating the same thing over and over, and for when it does actually add value - that is fuzz testing. 7min is still slow for regular CI usage.

@asmeurer
Copy link
Member

asmeurer commented Apr 10, 2024

Very strong disagree. A very low number of examples do not actually "run the tests" in the way that people think that they do, and is very misleading. If you look at the tests in the test suite, each test is testing 10 or 20 different things. It's very hard for a single, randomly selected example to be representative enough to check everything. Consider that different "examples" mean different combinations of things like shapes and dtypes. Testing one example, as Ralf suggests, would mean just checking an array with just one of the supported dtypes with some shape, and some values.

I tested running test_add (a function with a relatively simple possible set of inputs) with one example, and the example arrays hypothesis generated was (array(0., dtype=uint8), array(0., dtype=uint8)). Furthermore, I ran the exact same command (ARRAY_API_TESTS_MODULE=numpy.array_api pytest --max-examples=1 -s -k test_add array_api_tests/test_operators_and_elementwise_functions.py --disable-warnings) multiple times, and it generated that exact same example every single time.

This is how hypothesis works. It generates interesting examples, where "interesting" means either corner cases (like this one) or general cases (like a "random" array). But it can only do this if you give it a chance to. There's a reason the hypothesis default is 100 examples per run.

If we set max-examples to 1, test_add would not test broadcasting, behavior on floating-point dtypes, complex number support, support for other integer dtypes, or even that add(x, y) is actually performing addition. It would be like having a single test

def test_add():
    x = array(0., dtype=uint8)
    y = array(0., dtype=uint8)
    assert_all(add(x, y), array(0, dtype=uint8))

This would obviously not be a very good test for the add function in the standard.

I think the slowest adoptor1 should be our lowest common denominator for how we set default runs of the test suite

Strong disagree here too. We should optimize for the common case, not the worst case. JAX can manually set their example count lower in their CI if they want to. It's easy to do that with the flags we've provided. But the default behavior should be something reasonable for most libraries, who aren't going to set that flag (and probably don't even realize that it is there).

I agree that we need to make the CI times reasonable, but the other improvements like vectorization have done this enough that I don't see the need to lower the default to 20 any more. With the latest version of the test suite, NumPy 2.0 runs in 16 seconds with max-examples set to 100. If that's not fast enough for you, I really don't know what to say. Really, if anything, we should take this as a opportunity to increase the number of examples that are run by default and improve the default test coverage. Again, tests that don't get covered in the first (or tenth or hundredth) run of the tests are very common. Just look at how often I have to update the XFAILs file for array-api-compat because some rare error only pops up sometimes on CI https://github.com/data-apis/array-api-compat/commits/main/numpy-1-21-xfails.txt

Note that I am only talking about the normal tests here. I am fine with keeping the special case tests at a lower number of examples. Although even there we should be careful that some special cases are not fully covered by one example, and we should be cautious of the apparent hypothesis behavior of giving zero variability when max-examples is set to 1.

@leofang
Copy link

leofang commented Apr 11, 2024

I do not believe a rigorous test suite would only take less than 1 min to run. This is simply impossible. Please do not over optimize. I second what Aaron said.

honno added a commit to honno/array-api-tests that referenced this pull request Apr 12, 2024
@rgommers
Copy link
Member

If you look at the tests in the test suite, each test is testing 10 or 20 different things.

Well okay, that is then not going to be possible with 1 example, I didn't know that (and it doesn't have to be that way, it's a choice made per test by the test author it looks like). This is a little problematic I think for a test suite design if you care about runtime, because it also means that 20 examples isn't enough because you have no guarantees that the first 20 examples are testing the 20 different things. I guess it becomes a stochastic thing - 25 is probably still not enough to reliably test everything; 30 or 40 has a high probability of being comprehensive, etc.

I do not believe a rigorous test suite would only take less than 1 min to run. This is simply impossible.

This is definitely not impossible. The test suite for numpy runs in O(100 sec), and tests (a) 10x more functions, (b) more computationally expensive functions on average, and (c) actually needs to compare numerical accuracy (unlike this test suite, where there's no expected answer that must be met). So if the test suite was written with similar techniques/efficiency as for numpy, then testing numpy's implementation shouldn't take more than 5-10 seconds.

Again, tests that don't get covered in the first (or tenth or hundredth) run of the tests are very common. Just look at how often I have to update the XFAILs file for array-api-compat because some rare error only pops up sometimes on CI

Yes indeed. This is a problem. This is not regular test suite usage. Instead, this is fuzz testing. Random failures like that should not happen for regular regression testing.

and we should be cautious of the apparent hypothesis behavior of giving zero variability when max-examples is set to 1.

Zero variability is what you want in regular usage. "flaky tests" are an issue for pretty much every large library. I know hypothesis doesn't make it easy to distinguish between these two fundamentally different flavors of testing (regression vs. fuzz), but both matter - and libraries typically want the former in their repos.

asmeurer added a commit to asmeurer/array-api-tests that referenced this pull request Apr 19, 2024
The tests run fast enough now that there's no need to lower this to 20.

See the discussion at data-apis#246.
@asmeurer
Copy link
Member

Ralf, it sounds like you are more discussing the merits of property-based testing in general. I don't know if it's really worth my time to try to convince you of its value. The fact is, hypothesis tests are the type of tests we chose to use for the array-api-tests test suite. It's my personal experience that you get much more value out of hypothesis testing if you embrace its paradigm instead of trying to fight it or shoehorn it into a traditional testing paradigm.

I've made a PR resetting the default max-examples back to 100. #254. You can see at data-apis/array-api-strict#34 I set the max-examples to double that (200), and the whole CI runs in a minute (that includes installing dependencies). 100 is the default value that hypothesis developers themselves chose, so I think we shouldn't change that, especially to something lower, unless there's a good reason.

@asmeurer
Copy link
Member

Regarding special-cases, is hypothesis used at all now? I ran

ARRAY_API_TESTS_MODULE=array_api_strict pytest "array_api_tests/test_special_cases.py::test_unary[asin(x_i > 1) -> NaN]" -s --hypothesis-verbosity=verbose --disable-warnings

(to pick a random example), and hypothesis didn't output anything. It might be worth continuing to use hypothesis at least for special cases that refer to multiple values, like this one. And also, as I said, ensuring that it actually varies the example across runs.

@honno
Copy link
Member Author

honno commented Apr 22, 2024

Regarding special-cases, is hypothesis used at all now?

Yes but not in a typical @given-decorated test case way, just to decoratively define special cases and then only one example is pulled. Something I'll write an issue on to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants