Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: disable fail-fast for all jobs #222

Closed
wants to merge 1 commit into from

Conversation

lazka
Copy link
Contributor

@lazka lazka commented Dec 17, 2023

In case one job fails permanently we still would like to know if the other jobs pass.

In case one job fails permanently we still would still like to
know if the other jobs pass.
@jaraco
Copy link
Member

jaraco commented Jan 6, 2024

I recall we enabled this setting in the past because setuptools was consuming a ton of CI resources and this was one of the recommended ways to limit the consumption. I recall other pypa projects being starved for CI resources when Setuptools is under active development (which is a lot of the time).

Additionally, the "fail fast" by default appears to be a best practice (since it's a GitHub default and also the default for skeleton-based projects), at least for tests.

If this is accepted here, what is the implication for other projects? Under what conditions should other projects accept these same tweaks? Have you considered adding this setting only to the non-tests jobs?

@lazka
Copy link
Contributor Author

lazka commented Jan 6, 2024

This option only makes a difference regarding resources if one of the jobs fails, and only once a job fails will the others be cancelled, which likely is at the end of the job, so all other jobs get cancelled right before they are finished. And the lack of information about the other jobs might require the contributor to debug more, resulting in more pipelines being created.

And somewhat related, 2 months ago GHA runners got double the CPU core count and ~50% faster CPUs.

@jaraco
Copy link
Member

jaraco commented Jan 6, 2024

all other jobs get cancelled right before they are finished.

That's not been my experience. In my experience, GHA queues up all of the jobs in parallel, but in practice only allocates resources to 2-3 runs at a time, even prioritizing the runs based on efficiency (a few Linux jobs first, then 1-2 mac/Windows, then the rest). I presumed someone at GitHub has worked hard to maximize signal and efficiency while saving resources. Very often, I see more than half of the runs never start, even if the bug exists only in Windows or Mac.

This option only makes a difference regarding resources if one of the jobs fails, and only once a job fails will the others be cancelled, which likely is at the end of the job, so all other jobs get cancelled right before they are finished. And the lack of information about the other jobs might require the contributor to debug more, resulting in more pipelines being created.

And somewhat related, 2 months ago GHA runners got double the CPU core count and ~50% faster CPUs.

That all sounds like a justification for GitHub changing their default... and not just distutils.

I'm fine to accept the changes to the distutils-specific jobs. I'm a little reluctant to accept it for the tests here, as it creates a divergence without a rationale. I suggest filing a report with GitHub to request they make fail-fast: false the default, or at least update the docs to advise under what conditions a project should override this value.

@lazka
Copy link
Contributor Author

lazka commented Jan 6, 2024

This PR was motivated by this PR: https://github.com/pypa/distutils/actions/runs/7175042050/job/19537650427
All jobs either finish in time or get cancelled mid run.

@jaraco
Copy link
Member

jaraco commented Jan 6, 2024

As I think about it more, I'm not confident that fail-fast: false will have any effect on test_cygwin or ci_setuptools jobs, because the docs say:

jobs.<job_id>.strategy.fail-fast applies to the entire matrix.

But the matrix for these jobs has only one job (compare that with test or collateral that have a matrix of multiple jobs). That makes me think the fail fast setting would have no effect. If that's the case, this PR will have no benefit after removing it from the test jobs, and has a net negative detriment by adding tweaks for settings that have no effect.

I'm going to close this for now, but feel free to follow-up if you can demonstrate that adding this setting to single-job jobs would have any effect.

@jaraco jaraco closed this Jan 6, 2024
@jaraco
Copy link
Member

jaraco commented Jan 6, 2024

This PR was motivated by this PR: https://github.com/pypa/distutils/actions/runs/7175042050/job/19537650427 All jobs either finish in time or get cancelled mid run.

I see. In that case, the tests are failing on Python 3.12. You can see where many tests have run to completion... and a handful of tests are cancelled (both those that would fail and succeed). My instinct is this result, while not giving full fidelity, does give a meaningful signal. It's easy to infer from the result that there's a problem with Python 3.12 but not other Pythons.

I do think the UI does also give the option to retry the cancelled jobs, which would make it easy to manually override a cancelled job.

image

I don't want to click it, as it may affect the analysis of this issue. It's possible (likely) that only project maintainers have access to click that button, which reduces the viability of that workaround.

@lazka
Copy link
Contributor Author

lazka commented Jan 7, 2024

You can only restart jobs one at a time, which means in the above case you would need to recreate the pipeline 4 times and wait for it to finish each time until you can restart the next one.

I'd argue that the purpose of CI is that I don't have to depend on my instincts but let a computer figure it out for me.

Regarding them having no effect in single job matrix jobs, I added the config to avoid differences between the jobs behaviour for any potential future changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants