Issues deploying to staging #10077

chris48s · 2024-04-05T20:01:40Z

📋 Description

Basically, what happened here was:

I reviewed chore(deps): bump simple-icons from 11.10.0 to 11.11.0 #10068 using a review app - worked fine
I merged it
I tried to deploy sha256:e77392c2f4cfa1a9da6a7754fd26911f560bc522e3fc0d56ee7386b910b0c5b1 to staging. Staging fails to serve the app with could not find a good candidate within 90 attempts at load balancing
I checked the app locally - worked fine
I tried deploying some previous versions to staging. They were all fine
I tried sha256:e77392c2f4cfa1a9da6a7754fd26911f560bc522e3fc0d56ee7386b910b0c5b1 again. Back to could not find a good candidate within 90 attempts at load balancing
I reverted the upgrade in Revert simple-icons 11.11.0 #10075
Deployed fine to staging
Deployed to production

I don't really have time to look into it right now, but we've got staging set to run a 256Mb VM. Prod uses 512Mb VMs. Review apps use whatever the default is (probably more than 256Mb).

My theory is that this update might now be using all the memory we have available on staging 😮 ? That would explain why it worked locally but not on staging. I'll circle back to this when I have a chance.

The text was updated successfully, but these errors were encountered:

chris48s · 2024-04-07T08:33:38Z

Hmm. Not sure this is related to this package specifically.

Just merged
caea759
went to deploy it to staging

could not find a good candidate within 90 attempts at load balancing again

tried bumping staging to 512Mb RAM

still failing

rolled back to the previous version. All is well.

Currently stumped :/

chris48s · 2024-04-07T13:23:36Z

Error is could not find a good candidate within 90 attempts at load balancing

only some docker hashes seem to trigger this behaviour
- sha256:e77392c2f4cfa1a9da6a7754fd26911f560bc522e3fc0d56ee7386b910b0c5b1
- sha256:42974bb5b7da023a8277fb7da86db1f884f5f8177d95f3ba8d14dd97184c9d35
  are known bad. I don't know of any other bad ones
SSH into an "unhealthy" instance, install curl. curl http://127.0.0.1 and curl http://0.0.0.0 work fine
Tried giving it MOAR memory, no difference
Tried fiddling around with the tcp_checks settings (matched staging to prod, matched staging to review apps). No difference. Massively increased the timeout and grace_period. No differece.
Crazy thing is, if I deploy one of the "bad" images (to staging) and flyctl scale count 2, the app suddenly becomes accessible. flyctl scale count 1 and it is back to could not find a good candidate within 90 attempts at load balancing
Sometimes I can get a machine deployed from one of the "bad" images to start serving traffic by restarting it, but not consistently.
Deploying once of the "bad" images, running flyctl scale count 2 and then deleting the first machine that was deployed (keeping the second) consistently results in one working machine serving traffic.
Disabling REQUIRE_CLOUDFLARE has no bearing on it. Disable that and hit https://shields-io-staging.fly.dev/ - same behaviour
This feels like fly.io being wierd and flaky, but why do some specific images trigger this behaviour while others work perfectly?

The one thing I haven't tried is just yeeting one of the "bad" images to production. My instinct is they'd probably work because we're running multiple instance, but I'm reluctant to just try that without understanding wtf is going on first.

Utterly baffling.

chris48s · 2024-04-07T14:00:13Z

Some things I haven't tried yet:

Deploy from GHCR or build a completely new image from the same commits + push to fly registry (to eliminate DockerHub)
Make a completely clean app on fly using the staging settings - does this reproduce?
Move the staging app to a different region
Ritual sacrifice 🐐

calebcartwright · 2024-04-08T00:11:18Z

Ritual sacrifice 🐐

I think i can pitch in on this one 👍

calebcartwright · 2024-04-09T18:46:46Z

Or could possibly be a platform level issue: https://community.fly.io/t/could-not-find-a-good-candidate-at-load-balancing-outage/19172/16 ?

chris48s · 2024-04-13T12:19:53Z

Thanks. When I circle back to this I will check that out.

If I end up against a brick wall again, we do have email support with fly so raising a support ticket is another option once I have time to follow up.

chris48s · 2024-04-17T15:14:16Z

OK. So I just tried deploying sha256:42974bb5b7da023a8277fb7da86db1f884f5f8177d95f3ba8d14dd97184c9d35 to staging again. Worked fine.

I think I am going to assume that there is nothing wrong with any of the images and we were triggering some kind of issue on fly's side.

Assuming I don't run into this again trying to deploy later I will assume this is fixed and close this.

chris48s · 2024-04-17T16:33:51Z

I've now successfully run staging and production deploys.
Going to close this.
Glad I didn't spend more time banging my head against this brick wall when I first hit it.

calebcartwright · 2024-04-17T18:19:20Z

Indeed! Glad it's sorted

chris48s added the operations Hosting, monitoring, and reliability for the production badge servers label Apr 5, 2024

chris48s mentioned this issue Apr 5, 2024

chore(deps): bump simple-icons from 11.10.0 to 11.11.0 #10076

Closed

chris48s added the dependencies Related to dependency updates label Apr 5, 2024

chris48s changed the title ~~Can't upgrade to SimpleIcons 11.11.0~~ Issues deploying to staging Apr 7, 2024

chris48s removed the dependencies Related to dependency updates label Apr 7, 2024

chris48s added the blocker PRs and epics which block other work label Apr 7, 2024

chris48s removed the blocker PRs and epics which block other work label Apr 17, 2024

chris48s closed this as completed Apr 17, 2024

chris48s mentioned this issue Apr 22, 2024

feat(logos): support auto-size mode #9191

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues deploying to staging #10077

Issues deploying to staging #10077

chris48s commented Apr 5, 2024 •

edited

Loading

chris48s commented Apr 7, 2024 •

edited

Loading

chris48s commented Apr 7, 2024 •

edited

Loading

chris48s commented Apr 7, 2024

calebcartwright commented Apr 8, 2024

calebcartwright commented Apr 9, 2024

chris48s commented Apr 13, 2024

chris48s commented Apr 17, 2024

chris48s commented Apr 17, 2024

calebcartwright commented Apr 17, 2024

Issues deploying to staging #10077

Issues deploying to staging #10077

Comments

chris48s commented Apr 5, 2024 • edited Loading

chris48s commented Apr 7, 2024 • edited Loading

chris48s commented Apr 7, 2024 • edited Loading

chris48s commented Apr 7, 2024

calebcartwright commented Apr 8, 2024

calebcartwright commented Apr 9, 2024

chris48s commented Apr 13, 2024

chris48s commented Apr 17, 2024

chris48s commented Apr 17, 2024

calebcartwright commented Apr 17, 2024

chris48s commented Apr 5, 2024 •

edited

Loading

chris48s commented Apr 7, 2024 •

edited

Loading

chris48s commented Apr 7, 2024 •

edited

Loading