-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: disable fail-fast for all jobs #222
Conversation
In case one job fails permanently we still would still like to know if the other jobs pass.
I recall we enabled this setting in the past because setuptools was consuming a ton of CI resources and this was one of the recommended ways to limit the consumption. I recall other pypa projects being starved for CI resources when Setuptools is under active development (which is a lot of the time). Additionally, the "fail fast" by default appears to be a best practice (since it's a GitHub default and also the default for skeleton-based projects), at least for tests. If this is accepted here, what is the implication for other projects? Under what conditions should other projects accept these same tweaks? Have you considered adding this setting only to the non-tests jobs? |
This option only makes a difference regarding resources if one of the jobs fails, and only once a job fails will the others be cancelled, which likely is at the end of the job, so all other jobs get cancelled right before they are finished. And the lack of information about the other jobs might require the contributor to debug more, resulting in more pipelines being created. And somewhat related, 2 months ago GHA runners got double the CPU core count and ~50% faster CPUs. |
That's not been my experience. In my experience, GHA queues up all of the jobs in parallel, but in practice only allocates resources to 2-3 runs at a time, even prioritizing the runs based on efficiency (a few Linux jobs first, then 1-2 mac/Windows, then the rest). I presumed someone at GitHub has worked hard to maximize signal and efficiency while saving resources. Very often, I see more than half of the runs never start, even if the bug exists only in Windows or Mac.
That all sounds like a justification for GitHub changing their default... and not just distutils. I'm fine to accept the changes to the distutils-specific jobs. I'm a little reluctant to accept it for the |
This PR was motivated by this PR: https://github.com/pypa/distutils/actions/runs/7175042050/job/19537650427 |
As I think about it more, I'm not confident that
But the matrix for these jobs has only one job (compare that with I'm going to close this for now, but feel free to follow-up if you can demonstrate that adding this setting to single-job jobs would have any effect. |
I see. In that case, the tests are failing on Python 3.12. You can see where many tests have run to completion... and a handful of tests are cancelled (both those that would fail and succeed). My instinct is this result, while not giving full fidelity, does give a meaningful signal. It's easy to infer from the result that there's a problem with Python 3.12 but not other Pythons. I do think the UI does also give the option to retry the cancelled jobs, which would make it easy to manually override a cancelled job. I don't want to click it, as it may affect the analysis of this issue. It's possible (likely) that only project maintainers have access to click that button, which reduces the viability of that workaround. |
You can only restart jobs one at a time, which means in the above case you would need to recreate the pipeline 4 times and wait for it to finish each time until you can restart the next one. I'd argue that the purpose of CI is that I don't have to depend on my instincts but let a computer figure it out for me. Regarding them having no effect in single job matrix jobs, I added the config to avoid differences between the jobs behaviour for any potential future changes. |
In case one job fails permanently we still would like to know if the other jobs pass.