Don't wait forever in cc-worker #256

sethboyles · 2022-08-01T22:52:13Z

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

cc-worker can block bosh from cancelling a deployment, since the post start waits forever:

capi-release/jobs/cloud_controller_worker/templates/post-start.sh.erb

Line 16 in 5f061c8

while ! nc -z localhost <%= p("cc.readiness_port.cloud_controller_worker") %>

Steps to Reproduce

CC migrations fail, or pre-start script fails in another way

Expected result

the deployment fails, the task is cancelled, and the lock is deleted

Current result

the deployment fails, the task hangs when cancelled, and the lock is not deleted

Possible Fix

introduce some kind of timeout in the post-start script

`name of issue` screenshot

[if relevant, include a screenshot]

The text was updated successfully, but these errors were encountered:

sethboyles · 2024-02-15T00:48:55Z

I think this was originally added because of this issue: #165, though I don't quite understand how it addresses it. See cloudfoundry/cloud_controller_ng@85b7ac8

BackgroundJobEnvironment still waits for DB to be migrated, so it seems to me that original issue would still occur. A cc-worker would never progress to the readiness port. I suppose it limits the failure to a single cc-worker?

Samze · 2024-08-09T22:24:01Z

Just documenting my understanding of this.

The current behaviour when pre-start suceeds:

CC api starts and runs starts to run migrations and eventually succeed
CC worker starts
1. Creates BackgroundEnvironment
2. That then loads models which waits on migrations to complete. This has a timeout of 2 weeks.
3. Migrations run successful, we stop waiting
4. If we're not the first process,
  1. We start processing jobs
5. If we're the first worker process per vm,
  1. we listen on readiness port
  2. When we receive a connection, we unblock.
  3. We start processing jobs
CC worker post-start starts
1. Loops trying to connect on readiness port
2. When we establish a connection, post-start script succeeds

Now in the case when pre-start fails:

CC api starts and runs starts to run migration and fail
CC worker starts
1. Creates BackgroundEnvironment
2. That then loads models which waits on migrations to complete....we wait forever.
CC worker post-start starts
1. Loops trying to connect on readiness port
2. Waits forever

Thoughts/Questions:

The issue described in CC-Worker should fail/stop if a migration is not completed yet #165 states that CC worker started and started to use tables that did not exist/were modified because migrations hadn't happened. Looking at the code, it seems like all CC workers call waiting_on_migrations! and how is this possible. Though the original issue refers to check_migrations method which no longer exists.
This logic in the post-start script basically stops bosh reporting the vm as ready until the worker vm has confirmed that migrations have run. Maybe this is important as migrations could take a long time to run? However this is not blocking any sort of worker from picking up jobs or anything.
1. I think it makes sense to only mark the bosh job as successful if migrations have passed. Though if there is an issue you will notice this status on the API vm, since the API vm will only show pre-start failures. So its certainly an option to just remove this check from the post-start logic of the worker.
2. To simplify this logic, could we not just effectively call waiting_for_migrations in the post-start script instead of doing this port dance.
The property cc.readiness_port.cloud_controller_worker is described as being used for k8s deployments, but we use it here for regular bosh deployments. Is there something I'm missing for why this is important for k8s?
Question remains, it seems like the current contract for api -> worker for readiness is that migrations have completed. This may not be the case if migrations failed or anything else in the pre-start script failed. So how can the worker post-start script determine a terminal failure in the api vm pre-start script?

note Probably missing some context here, will update this comment as I poke around more.

johha · 2024-08-12T08:50:20Z

Side note: There are parameters like max_migration_duration_in_minutes and max_migration_statement_runtime_in_seconds (only for psql) which allow operators to limit the total duration of migrations.

Samze · 2024-08-27T21:19:54Z

Another interesting observation,

From this ticket https://www.pivotaltracker.com/story/show/171417894 it seems like the purpose of this was to allow non-http jobs to have their readiness checked for k8s, and this readiness property was added to worker/clock/deployment_updater. However it is disabled by default in everything but the worker, see https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cc_deployment_updater/spec#L87C21-L87C39 and https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib/cloud_controller/background_job_environment.rb#L23.

This commit added it 56a7240 .

One option here would be to set the port back to default of -1 and update the post-start logic to only perform that check if the port is > 0. 🤔 The consequence of this means that the post-start will succeed and bosh will report success for the cc_worker job when the process is stuck waiting for migrations. However, this is exactly what we do for deployment_updater and clock. So not sure if there is something specific about the worker that would be an issue.

Another options is to be more aggressive and remove this entire check, however I'm not sure if there is still a k8s usecase for this.

TODO: work in progress Fixes #256

Samze · 2024-08-27T21:55:55Z

Another note, this is more important to fix now because this is much more likely due to the increased chances of failing cc api pre-start script due to the addition of the stack checker https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng/templates/pre-start.sh.erb#L145 . If the stack-checker fails on a deprecated stack and there is a new migration in the upgrade, this bug will 100% occur.

tcdowney · 2024-08-27T23:18:08Z

Another options is to be more aggressive and remove this entire check, however I'm not sure if there is still a k8s usecase for this.

I don't think there is a k8s use-case for this anymore. Cf-for-k8s which included capi-k8s-release was deprecated in 2021 and cloudfoundry/cloud_controller_ng#3140 pretty much ensured that Cloud Controller's native k8s support was removed. I think on Kubernetes we couldn't control the ordering of how these jobs started as well as we could on BOSH so some of these weird things were put in place to make it more tolerant of that sort of rollout.

I don't remember much more than that unfortunately cause I was mostly involved with the networking team at the time.

cf-gitbot added the unscheduled label Aug 1, 2022

Samze added a commit that referenced this issue Aug 27, 2024

Do not use readiness port on worker by default

8c7fa10

TODO: work in progress Fixes #256

Samze linked a pull request Aug 27, 2024 that will close this issue

WIP: Do not use readiness port on worker by default #464

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't wait forever in cc-worker #256

Don't wait forever in cc-worker #256

sethboyles commented Aug 1, 2022 •

edited

Loading

sethboyles commented Feb 15, 2024

Samze commented Aug 9, 2024 •

edited

Loading

johha commented Aug 12, 2024

Samze commented Aug 27, 2024 •

edited

Loading

Samze commented Aug 27, 2024

tcdowney commented Aug 27, 2024

Don't wait forever in cc-worker #256

Don't wait forever in cc-worker #256

Comments

sethboyles commented Aug 1, 2022 • edited Loading

Issue

Steps to Reproduce

Expected result

Current result

Possible Fix

name of issue screenshot

sethboyles commented Feb 15, 2024

Samze commented Aug 9, 2024 • edited Loading

johha commented Aug 12, 2024

Samze commented Aug 27, 2024 • edited Loading

Samze commented Aug 27, 2024

tcdowney commented Aug 27, 2024

sethboyles commented Aug 1, 2022 •

edited

Loading

`name of issue` screenshot

Samze commented Aug 9, 2024 •

edited

Loading

Samze commented Aug 27, 2024 •

edited

Loading