Faster default runs #246

honno · 2024-03-22T11:32:59Z

Some changes that significantly speed-up the run of the test suite:

Defaults max examples to 20
Only runs one example for main special case tests
More tests appropiately marked with unvectorized

And set max examples default to 20

* Set max examples default to 20 * For faster default runs, over more test quality * Remove unreachable loading default settings path

honno · 2024-03-22T18:49:15Z

(Seems my special case changes broke some things, couldn't figure it out now so will have a look next week 😅)

asmeurer · 2024-03-22T19:43:55Z

conftest.py

@@ -15,7 +15,6 @@

 from reporting import pytest_metadata, pytest_json_modifyreport, add_extra_json_metadata # noqa

-settings.register_profile("xp_default", deadline=800)


What is the deadline without this here?

200 good catch, Ill make it default to 800 again!

We might need to think about deadlines a bit more (but in a separate issue). Deadline errors are very common, even with 800. It's not just with JAX, I see them with every library. I disabled the deadline in the array-api-compat and array-api-strict CI. It looks like numpy does too. I think there's just a fundamental incongruity with the hypothesis deadline and the arrays strategy and the test suite.

I think there's just a fundamental incongruity with the hypothesis deadline and the arrays strategy and the test suite.

Yep thats a sound observation. For now I've made it default to 800 like before, and we can mull it over.

asmeurer · 2024-03-22T19:44:53Z

Need to update the README for the new max-examples default.

honno · 2024-04-03T08:50:04Z

Nothing should be controversial now so will merge as nice to get in.

asmeurer · 2024-04-08T22:11:38Z

Hmm, the tests run suspiciously fast now (under a minute). Are we really still testing everything as rigorously as we used to? I'm all for faster runs, but we also want to make sure that actual failures are caught, ideally without having to rerun the tests multiple times.

honno · 2024-04-09T08:57:48Z

Hmm, the tests run suspiciously fast now (under a minute). Are we really still testing everything as rigorously as we used to? I'm all for faster runs, but we also want to make sure that actual failures are caught, ideally without having to rerun the tests multiple times.

I think we're striking a nice balance. The CI job is an extreme example as array_api_strict/NumPy is fairly Hypothesis-friendly and so the tests run quickly, plus we have a skips file for any failures. Something like jax.experimental.array_api still takes ~7mins to run locally, where array_api_strict takes 20 seconds... that is a rather big jump tbf, but admittedly JAX has more errors too which really contribute to the timings.

asmeurer · 2024-04-09T19:20:54Z

Is this primarily coming from vectorization? If we're really running this fast, I don't see the need to reduce the default number of examples.

honno · 2024-04-10T07:30:28Z

Is this primarily coming from vectorization?

Vectorization definitely helps, it's also that we're now reducing the max examples for the tests which aren't vectorized, plus only using a single example for the special case tests.

If we're really running this fast, I don't see the need to reduce the default number of examples.

I think the slowest adoptor¹ should be our lowest common denominator for how we set default runs of the test suite, i.e. JAX is taking 7mins (which even then is longer than ideal for a default run IMO). Then we can leave it to users—and ourselves—to bump the max examples for CI jobs.

On a general note on examples: in my experience it's often only a few simple examples that cause failures. So the super interesting examples that come with a larger max are not that rewarding for the significant time add.

not to say JAX is slow, more-so how it interplays with how the test suite works. The biggest element is error reporting is slow because Hypothesis stresses through the code its testing to make cleaner and reproducible examples. ↩

rgommers · 2024-04-10T07:59:09Z

Agreed with @honno's comments above - default number of examples should be optimized for fast test suite execution time, and having a high number of examples really isn't very interesting anyway. I'd honestly set it to 1, not 20. 1 is regular CI "is everything implemented and does it all work". 20/200/inf is mostly repeating the same thing over and over, and for when it does actually add value - that is fuzz testing. 7min is still slow for regular CI usage.

asmeurer · 2024-04-10T22:23:31Z

Very strong disagree. A very low number of examples do not actually "run the tests" in the way that people think that they do, and is very misleading. If you look at the tests in the test suite, each test is testing 10 or 20 different things. It's very hard for a single, randomly selected example to be representative enough to check everything. Consider that different "examples" mean different combinations of things like shapes and dtypes. Testing one example, as Ralf suggests, would mean just checking an array with just one of the supported dtypes with some shape, and some values.

I tested running test_add (a function with a relatively simple possible set of inputs) with one example, and the example arrays hypothesis generated was (array(0., dtype=uint8), array(0., dtype=uint8)). Furthermore, I ran the exact same command (ARRAY_API_TESTS_MODULE=numpy.array_api pytest --max-examples=1 -s -k test_add array_api_tests/test_operators_and_elementwise_functions.py --disable-warnings) multiple times, and it generated that exact same example every single time.

This is how hypothesis works. It generates interesting examples, where "interesting" means either corner cases (like this one) or general cases (like a "random" array). But it can only do this if you give it a chance to. There's a reason the hypothesis default is 100 examples per run.

If we set max-examples to 1, test_add would not test broadcasting, behavior on floating-point dtypes, complex number support, support for other integer dtypes, or even that add(x, y) is actually performing addition. It would be like having a single test

def test_add():
    x = array(0., dtype=uint8)
    y = array(0., dtype=uint8)
    assert_all(add(x, y), array(0, dtype=uint8))

This would obviously not be a very good test for the add function in the standard.

I think the slowest adoptor1 should be our lowest common denominator for how we set default runs of the test suite

Strong disagree here too. We should optimize for the common case, not the worst case. JAX can manually set their example count lower in their CI if they want to. It's easy to do that with the flags we've provided. But the default behavior should be something reasonable for most libraries, who aren't going to set that flag (and probably don't even realize that it is there).

I agree that we need to make the CI times reasonable, but the other improvements like vectorization have done this enough that I don't see the need to lower the default to 20 any more. With the latest version of the test suite, NumPy 2.0 runs in 16 seconds with max-examples set to 100. If that's not fast enough for you, I really don't know what to say. Really, if anything, we should take this as a opportunity to increase the number of examples that are run by default and improve the default test coverage. Again, tests that don't get covered in the first (or tenth or hundredth) run of the tests are very common. Just look at how often I have to update the XFAILs file for array-api-compat because some rare error only pops up sometimes on CI https://github.com/data-apis/array-api-compat/commits/main/numpy-1-21-xfails.txt

Note that I am only talking about the normal tests here. I am fine with keeping the special case tests at a lower number of examples. Although even there we should be careful that some special cases are not fully covered by one example, and we should be cautious of the apparent hypothesis behavior of giving zero variability when max-examples is set to 1.

leofang · 2024-04-11T02:04:22Z

I do not believe a rigorous test suite would only take less than 1 min to run. This is simply impossible. Please do not over optimize. I second what Aaron said.

Faster default runs

rgommers · 2024-04-15T18:58:16Z

If you look at the tests in the test suite, each test is testing 10 or 20 different things.

Well okay, that is then not going to be possible with 1 example, I didn't know that (and it doesn't have to be that way, it's a choice made per test by the test author it looks like). This is a little problematic I think for a test suite design if you care about runtime, because it also means that 20 examples isn't enough because you have no guarantees that the first 20 examples are testing the 20 different things. I guess it becomes a stochastic thing - 25 is probably still not enough to reliably test everything; 30 or 40 has a high probability of being comprehensive, etc.

I do not believe a rigorous test suite would only take less than 1 min to run. This is simply impossible.

This is definitely not impossible. The test suite for numpy runs in O(100 sec), and tests (a) 10x more functions, (b) more computationally expensive functions on average, and (c) actually needs to compare numerical accuracy (unlike this test suite, where there's no expected answer that must be met). So if the test suite was written with similar techniques/efficiency as for numpy, then testing numpy's implementation shouldn't take more than 5-10 seconds.

Again, tests that don't get covered in the first (or tenth or hundredth) run of the tests are very common. Just look at how often I have to update the XFAILs file for array-api-compat because some rare error only pops up sometimes on CI

Yes indeed. This is a problem. This is not regular test suite usage. Instead, this is fuzz testing. Random failures like that should not happen for regular regression testing.

and we should be cautious of the apparent hypothesis behavior of giving zero variability when max-examples is set to 1.

Zero variability is what you want in regular usage. "flaky tests" are an issue for pretty much every large library. I know hypothesis doesn't make it easy to distinguish between these two fundamentally different flavors of testing (regression vs. fuzz), but both matter - and libraries typically want the former in their repos.

The tests run fast enough now that there's no need to lower this to 20. See the discussion at data-apis#246.

asmeurer · 2024-04-19T19:47:00Z

Ralf, it sounds like you are more discussing the merits of property-based testing in general. I don't know if it's really worth my time to try to convince you of its value. The fact is, hypothesis tests are the type of tests we chose to use for the array-api-tests test suite. It's my personal experience that you get much more value out of hypothesis testing if you embrace its paradigm instead of trying to fight it or shoehorn it into a traditional testing paradigm.

I've made a PR resetting the default max-examples back to 100. #254. You can see at data-apis/array-api-strict#34 I set the max-examples to double that (200), and the whole CI runs in a minute (that includes installing dependencies). 100 is the default value that hypothesis developers themselves chose, so I think we shouldn't change that, especially to something lower, unless there's a good reason.

asmeurer · 2024-04-19T19:50:19Z

Regarding special-cases, is hypothesis used at all now? I ran

ARRAY_API_TESTS_MODULE=array_api_strict pytest "array_api_tests/test_special_cases.py::test_unary[asin(x_i > 1) -> NaN]" -s --hypothesis-verbosity=verbose --disable-warnings

(to pick a random example), and hypothesis didn't output anything. It might be worth continuing to use hypothesis at least for special cases that refer to multiple values, like this one. And also, as I said, ensuring that it actually varies the example across runs.

honno · 2024-04-22T19:01:42Z

Regarding special-cases, is hypothesis used at all now?

Yes but not in a typical @given-decorated test case way, just to decoratively define special cases and then only one example is pulled. Something I'll write an issue on to discuss.

honno added 5 commits March 22, 2024 11:20

Add unvectorized markers to test_linalg.py

14ac81a

Add unvectorized marker to test_concat

66fe25e

Limit main special case tests to one example for now

1d82bd9

Remove unreachable loading default settings path

07c8cd0

And set max examples default to 20

conftest.py updates

36f4c1e

* Set max examples default to 20 * For faster default runs, over more test quality * Remove unreachable loading default settings path

honno requested a review from asmeurer March 22, 2024 11:32

asmeurer reviewed Mar 22, 2024

View reviewed changes

honno added 4 commits March 25, 2024 12:10

Default Hypothesis deadline to 800 again

4b12b0e

Update max example defaults in README

cfd9601

Store original special case text in processed case objects

fa5b9a7

Add missing bound from dtypes for sign special cases

5aace72

honno requested a review from asmeurer April 2, 2024 12:21

Remove disable deadline flag for CI

ae1605f

honno merged commit 3cf8ef6 into data-apis:master Apr 3, 2024
4 checks passed

honno added a commit to honno/array-api-tests that referenced this pull request Apr 12, 2024

Merge pull request data-apis#246 from honno/faster-default-runs

dc168a0

Faster default runs

honno mentioned this pull request Apr 16, 2024

DeadlineExceeded failures in test_eye/test_linspace #214

Open

asmeurer added a commit to asmeurer/array-api-tests that referenced this pull request Apr 19, 2024

Reset the max-examples default back to 100

5818ec1

The tests run fast enough now that there's no need to lower this to 20. See the discussion at data-apis#246.

asmeurer mentioned this pull request Apr 19, 2024

Reset the max-examples default back to 100 #254

Merged

asmeurer mentioned this pull request Jul 29, 2024

BUG: astype() function does not work on scalars numpy/numpy#27072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster default runs #246

Faster default runs #246

honno commented Mar 22, 2024

honno commented Mar 22, 2024

asmeurer Mar 22, 2024

honno Mar 27, 2024

asmeurer Mar 27, 2024

honno Apr 2, 2024

asmeurer commented Mar 22, 2024

honno commented Apr 3, 2024

asmeurer commented Apr 8, 2024

honno commented Apr 9, 2024

asmeurer commented Apr 9, 2024

honno commented Apr 10, 2024 •

edited

Loading

rgommers commented Apr 10, 2024

asmeurer commented Apr 10, 2024 •

edited

Loading

leofang commented Apr 11, 2024

rgommers commented Apr 15, 2024

asmeurer commented Apr 19, 2024

asmeurer commented Apr 19, 2024

honno commented Apr 22, 2024

		@@ -15,7 +15,6 @@

		from reporting import pytest_metadata, pytest_json_modifyreport, add_extra_json_metadata # noqa

		settings.register_profile("xp_default", deadline=800)

Faster default runs #246

Faster default runs #246

Conversation

honno commented Mar 22, 2024

honno commented Mar 22, 2024

asmeurer Mar 22, 2024

Choose a reason for hiding this comment

honno Mar 27, 2024

Choose a reason for hiding this comment

asmeurer Mar 27, 2024

Choose a reason for hiding this comment

honno Apr 2, 2024

Choose a reason for hiding this comment

asmeurer commented Mar 22, 2024

honno commented Apr 3, 2024

asmeurer commented Apr 8, 2024

honno commented Apr 9, 2024

asmeurer commented Apr 9, 2024

honno commented Apr 10, 2024 • edited Loading

Footnotes

rgommers commented Apr 10, 2024

asmeurer commented Apr 10, 2024 • edited Loading

leofang commented Apr 11, 2024

rgommers commented Apr 15, 2024

asmeurer commented Apr 19, 2024

asmeurer commented Apr 19, 2024

honno commented Apr 22, 2024

honno commented Apr 10, 2024 •

edited

Loading

asmeurer commented Apr 10, 2024 •

edited

Loading