CI: disable fail-fast for all jobs #222

lazka · 2023-12-17T09:57:29Z

In case one job fails permanently we still would like to know if the other jobs pass.

In case one job fails permanently we still would still like to know if the other jobs pass.

jaraco · 2024-01-06T18:16:10Z

I recall we enabled this setting in the past because setuptools was consuming a ton of CI resources and this was one of the recommended ways to limit the consumption. I recall other pypa projects being starved for CI resources when Setuptools is under active development (which is a lot of the time).

Additionally, the "fail fast" by default appears to be a best practice (since it's a GitHub default and also the default for skeleton-based projects), at least for tests.

If this is accepted here, what is the implication for other projects? Under what conditions should other projects accept these same tweaks? Have you considered adding this setting only to the non-tests jobs?

lazka · 2024-01-06T19:16:31Z

This option only makes a difference regarding resources if one of the jobs fails, and only once a job fails will the others be cancelled, which likely is at the end of the job, so all other jobs get cancelled right before they are finished. And the lack of information about the other jobs might require the contributor to debug more, resulting in more pipelines being created.

And somewhat related, 2 months ago GHA runners got double the CPU core count and ~50% faster CPUs.

jaraco · 2024-01-06T20:04:50Z

all other jobs get cancelled right before they are finished.

That's not been my experience. In my experience, GHA queues up all of the jobs in parallel, but in practice only allocates resources to 2-3 runs at a time, even prioritizing the runs based on efficiency (a few Linux jobs first, then 1-2 mac/Windows, then the rest). I presumed someone at GitHub has worked hard to maximize signal and efficiency while saving resources. Very often, I see more than half of the runs never start, even if the bug exists only in Windows or Mac.

This option only makes a difference regarding resources if one of the jobs fails, and only once a job fails will the others be cancelled, which likely is at the end of the job, so all other jobs get cancelled right before they are finished. And the lack of information about the other jobs might require the contributor to debug more, resulting in more pipelines being created.

And somewhat related, 2 months ago GHA runners got double the CPU core count and ~50% faster CPUs.

That all sounds like a justification for GitHub changing their default... and not just distutils.

I'm fine to accept the changes to the distutils-specific jobs. I'm a little reluctant to accept it for the tests here, as it creates a divergence without a rationale. I suggest filing a report with GitHub to request they make fail-fast: false the default, or at least update the docs to advise under what conditions a project should override this value.

lazka · 2024-01-06T20:12:31Z

This PR was motivated by this PR: https://github.com/pypa/distutils/actions/runs/7175042050/job/19537650427
All jobs either finish in time or get cancelled mid run.

jaraco · 2024-01-06T20:35:39Z

As I think about it more, I'm not confident that fail-fast: false will have any effect on test_cygwin or ci_setuptools jobs, because the docs say:

jobs.<job_id>.strategy.fail-fast applies to the entire matrix.

But the matrix for these jobs has only one job (compare that with test or collateral that have a matrix of multiple jobs). That makes me think the fail fast setting would have no effect. If that's the case, this PR will have no benefit after removing it from the test jobs, and has a net negative detriment by adding tweaks for settings that have no effect.

I'm going to close this for now, but feel free to follow-up if you can demonstrate that adding this setting to single-job jobs would have any effect.

jaraco · 2024-01-06T20:45:28Z

This PR was motivated by this PR: https://github.com/pypa/distutils/actions/runs/7175042050/job/19537650427 All jobs either finish in time or get cancelled mid run.

I see. In that case, the tests are failing on Python 3.12. You can see where many tests have run to completion... and a handful of tests are cancelled (both those that would fail and succeed). My instinct is this result, while not giving full fidelity, does give a meaningful signal. It's easy to infer from the result that there's a problem with Python 3.12 but not other Pythons.

I do think the UI does also give the option to retry the cancelled jobs, which would make it easy to manually override a cancelled job.

I don't want to click it, as it may affect the analysis of this issue. It's possible (likely) that only project maintainers have access to click that button, which reduces the viability of that workaround.

lazka · 2024-01-07T04:46:17Z

You can only restart jobs one at a time, which means in the above case you would need to recreate the pipeline 4 times and wait for it to finish each time until you can restart the next one.

I'd argue that the purpose of CI is that I don't have to depend on my instincts but let a computer figure it out for me.

Regarding them having no effect in single job matrix jobs, I added the config to avoid differences between the jobs behaviour for any potential future changes.

CI: disable fail-fast for all jobs

3487660

In case one job fails permanently we still would still like to know if the other jobs pass.

lazka mentioned this pull request Dec 17, 2023

#219 Use sysconfig.get_config_h_filename() to locate pyconfig.h #220

Merged

jaraco closed this Jan 6, 2024

jaraco mentioned this pull request Mar 21, 2024

Apply Scientific Python repo-review suggestions jaraco/skeleton#112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: disable fail-fast for all jobs #222

CI: disable fail-fast for all jobs #222

lazka commented Dec 17, 2023 •

edited

Loading

jaraco commented Jan 6, 2024

lazka commented Jan 6, 2024

jaraco commented Jan 6, 2024

lazka commented Jan 6, 2024 •

edited

Loading

jaraco commented Jan 6, 2024

jaraco commented Jan 6, 2024

lazka commented Jan 7, 2024 •

edited

Loading

CI: disable fail-fast for all jobs #222

CI: disable fail-fast for all jobs #222

Conversation

lazka commented Dec 17, 2023 • edited Loading

jaraco commented Jan 6, 2024

lazka commented Jan 6, 2024

jaraco commented Jan 6, 2024

lazka commented Jan 6, 2024 • edited Loading

jaraco commented Jan 6, 2024

jaraco commented Jan 6, 2024

lazka commented Jan 7, 2024 • edited Loading

lazka commented Dec 17, 2023 •

edited

Loading

lazka commented Jan 6, 2024 •

edited

Loading

lazka commented Jan 7, 2024 •

edited

Loading