-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bot missing recent LLVM releases #2531
Comments
Thanks John! Other feedstocks that build from the exact same tag & sources had varying degrees of success.
|
Not sure what changed, but the bot opened a bunch of PRs for 18.1.5 ~8 hours ago:
(only mlir still missing) |
I think the bot has become sentient and only opens PRs now after we make issues noting they are not there! 😱 J/k but indeed the behavior is puzzling. Typically the bot will try to make a version PR three times. If those three times fail, then the PR is put in the backlog. PRs in the backlog are tried at random after any newly found versions are tried. So it could be they were backlog and the bot finally cleared them. |
In case of 18.1.4, the PRs didn't get opened over a period of 2 weeks, which is pretty long. I think the issue is that it fails at all the first three times. The tag with the sources is there for all feedstocks equally. I guess one possible explanation is that upstream created the tags, but left a longer-than-usual gap in uploading the Still, that doesn't really explain why almost all missing PRs now got opened after we were discussing it here - spooky! 😅 |
Lol 😂 Here are some other random guesses Did the fix you made yesterday Matt potentially have an effect on LLVM and friends? Perhaps another possibility is some dependency changes over time Another thing of interest might be memory pressure. The LLVM (and Arrow) recipes are a bit more complicated. So may be using more resources than the usual resource light CI jobs have. Recognize there have been improvements made in various places (including conda-build) though don't know which of these fixes are out in releases. If we see more evidence of this, it might be worth profiling Lastly recall there were some issues in the bot ~2weeks ago that got cleared out. IIRC the first of the missed LLVM version updates was around then |
@ytausch Can you also look at this? This is an issue near to the tooling you write. |
Curious how things are looking a month later. Ok if we don't know. Just wanted to check back in 🙂 |
I am currently pushing on decoupling some of the bot's code that will make not only the version check but also the migration itself (that seems to be failing here) run locally with debugging enabled, which will provide a sustainable solution for problems like this one. For that reason, I did not prioritize to look into this manually so far. Let me know if you see this differently. |
LLVM 18.1.6 worked fine (bot opened all relevant PRs); LLVM 18.1.7 got tagged >24h ago, but the official release was only ~7h ago. Since we're generally relying on
I guess this is somewhat unavoidable as long as upstream has a long enough gap between tagging and uploading the tarballs. The solutions I see are:
I think the last approach might actually be the sanest one. |
Given every project needs to wait in 6hr increments for updates. This seems like reasonable behavior from the bot so far If there are ways to check for version updates more frequently than 6hrs, that seems like the best path for improvement (and is not specific to LLVM) |
I don't see how that changes anything - the bot will just go into a "max failure" state faster. It's the recovery period after having hit max retries that seems to take 2-3 weeks (which presumably is the thing that would be effective to reduce). In any case, switching to github-generated sources should 100% fix this problem for LLVM (and we don't even have submodules to deal with, so no benefit to using the upstream tarballs). |
The difference would be we would not need to wait another 6hrs once the source is available. We might wait 1hr or perhaps less. It also depends on whether we can move to something event-driven (as opposed to scraping-based) However the downside with GitHub generated sources is they are dynamically generated on-demand. So their checksums can change between retrievals |
The point is we cannot influence the delay between tag creation and when the tarballs are uploaded; this may well be 24h, so retrying more often in that times has no use whatsoever. The only option would be to distinguish somehow between tag and tarball availability, and not count it as a failure if the tag is there, but the tarball isn't. But that's just "more retries" by another name.
That basically never happens because everyone and their dog depends on them being stable. 😅 |
You may be able to restrict how the bot searches for versions so that it doesn't find the tag before the tarball is uploaded. I think you'd want to only have it look for URLS and not use github's RSS feed. |
Also, I very much doubt it is feasible that the bot could respond to release events from projects in an event driven system. |
On the contrary, this happens quite regularly This affected us with the conda-build 24.5.0 release ( conda-forge/conda-build-feedstock#226 ) and conda before that ( conda-forge/conda-feedstock#228 (comment) ). There are well documented cases elsewhere ( https://github.com/orgs/community/discussions/45830 ). In fact when I have asked GitHub about stability with these in the past, they have noted they generate artifacts dynamically and run some tests, but checksums can change (so no guarantees). This issue has been going on for quite some time The general movement (even from GitHub) is more validation around artifacts (not less). Here is a blogpost from GitHub last month on setting up artifact attestations, which provide even more information around the artifacts published beyond being stable (including an associated Think we should consider carefully how we get are compiler source code and put a preference towards stable artifacts (ideally with more provenance data if possible) |
We have long wanted an event driven system ( #54 ). Including for version updates ( #54 (comment) ) Agree this may very well be a more substantial undertaking That said, don't think we should rule out that possibility or associated discussion simply because of that. Specing it out would be the first step in creating a shovel ready project when someone interested shows up with resources interested in helping out |
I'm aware of the cases you mention, and I still don't agree with "regularly". The first time GH changed the default compression level it broke the world (e.g. bazel recipes everywhere), and they reverted. We're relying on GitHub generated tarballs in many hundreds of feedstocks, and I can count on one hand the unexplained hash changes that happened in the last couple of years in working across a similar number of feedstocks. But even if a spurious change does happen, it is by far a smaller encumbrance than the bot tripping over itself and not opening PRs at all. |
The frequency is not what is at issue. The unreliability is For core infrastructure (like compilers), we should know reliably what it is produced from |
We do know what it's produced from, i.e. the exact git tag. Whether the hash changes due to compression level or whatever else is completely irrelevant for provenance. Unless you are thinking about a scenario where github gets so compromised that someone can hijack the tarball generation, but that's not a realistic scenario to me (and we'd have much bigger problems then). If you can solve the problem of the bot not opening PRs, whether through an event-based solution or some workaround in the bot infra, I'll happily switch back to the "official" tarballs (which, BTW, also aren't audited or signed). But it's not an option to have the bot regularly fail to issue PRs for this interrelated stack of feedstocks that are already a handful to maintain even with bot support. |
This would work, yes. Making this configureable on a per-feedstock basis is probably not too complicated with an additional configuration option in the bot section of |
Here's an example today of a GitHub autogenerated tarball having its checksum change |
FWIW, since switching to GitHub tags across the LLVM feedstocks, PRs were opened without problems (and no hashing issues observed either) |
Hmm, it doesn't seem like this should happen, as GitHub did not announce a change like this. Currently, GitHub-generated source archives should be stable, and they intend to announce any changes to this within six months notice.
Edit: Oops, you seem to be right. Probably it only happens very rarely that the hashes differ? |
I just found out this feature exists already: Will create PRs for the LLVM repos. |
this should prevent regro/cf-scripts#2531 from happening again
It appears the bot started missing recent LLVM releases
The last bot release PR was 18.1.3:
However the last two needed to be handled manually:
That said, the bot does appear to have detected the releases:
So maybe there is an issue cropping up in the next step
The text was updated successfully, but these errors were encountered: