Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: using domain-qualified finalizers #6023

Merged
merged 3 commits into from
Jan 4, 2025
Merged

Conversation

trutx
Copy link
Contributor

@trutx trutx commented Nov 18, 2024

Tracking issue

Closes #6019

Why are the changes needed?

Switching to domain-qualified finalizers. Kubernetes introduced a warning in kubernetes/kubernetes#119508 so using old finalizers is harmless today, but updating them for the sake of clean Flyte admin logs and in advance of possible future enforcements of such finalizers.

What changes were proposed in this pull request?

  • Switch to a global domain-qualified finalizer: from flyte-finalizer to flyte.org/finalizer
  • Switch the k8s plugin finalizer from flyte/flytek8s to flyte.org/finalizer-k8s
  • Switch the array plugin finalizer from flyte/array to flyte.org/finalizer-array
  • Remove the finalizers.go and finalizers_test.go files and start leveraging the finalizer goodies in the upstream controllerutil package
  • Keep the removal of the old finalizer for backwards compatibility. It will no longer be added. This part should be eventually removed
  • Stop removing all finalizers and only remove Flyte's instead. This allows users to add their own finalizers

How was this patch tested?

Unit tests were modified to check for the presence/absence of the new finalizer. Some new tests were added and some existing tests were fixed so that clearFinalizer() func was actually run.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Signed-off-by: Roger Torrentsgenerós <rogert@spotify.com>
Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 53.65854% with 19 lines in your changes missing coverage. Please review.

Project coverage is 36.99%. Comparing base (d1a723e) to head (202cce9).
Report is 69 commits behind head on master.

Files with missing lines Patch % Lines
flyteplugins/go/tasks/plugins/array/k8s/subtask.go 0.00% 12 Missing and 1 partial ⚠️
...er/pkg/controller/nodes/task/k8s/plugin_manager.go 62.50% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6023      +/-   ##
==========================================
- Coverage   37.03%   36.99%   -0.04%     
==========================================
  Files        1313     1317       +4     
  Lines      131622   132471     +849     
==========================================
+ Hits        48742    49006     +264     
- Misses      78652    79210     +558     
- Partials     4228     4255      +27     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.05% <100.00%> (-0.03%) ⬇️
unittests-flytecopilot 30.99% <ø> (+8.76%) ⬆️
unittests-flytectl 62.29% <ø> (-0.18%) ⬇️
unittests-flyteidl 7.24% <ø> (-0.02%) ⬇️
unittests-flyteplugins 53.84% <0.00%> (+0.16%) ⬆️
unittests-flytepropeller 42.64% <77.77%> (-0.47%) ⬇️
unittests-flytestdlib 55.18% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Roger Torrentsgenerós <rogert@spotify.com>
},
}

assert.NoError(t, fakeKubeClient.GetClient().Create(ctx, o))

p.OnBuildIdentityResource(ctx, tctx.TaskExecutionMetadata()).Return(o, nil)
pluginManager := PluginManager{plugin: &p, kubeClient: fakeKubeClient}
pluginManager := PluginManager{plugin: &p, kubeClient: fakeKubeClient, updateBackoffRetries: 5}
Copy link
Contributor Author

@trutx trutx Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This updateBackoffRetries parameter ends up being e.updateBackoffRetries in

	retryBackoff := wait.Backoff{
		Duration: time.Duration(e.updateBaseBackoffDuration) * time.Millisecond,
		Factor:   2.0,
		Jitter:   0.1,
		Steps:    e.updateBackoffRetries,
	}

If this is unset (which it was in tests), then Steps: 0 which means wait.ExponentialBackoff(retryBackoff, func() (bool, error) inside Finalize() times out almost immediately, which effectively means e.clearFinalizer() is never called and the finalizer is never removed. The default setting is 5 but I think it's too dangerous a) to allow users to tweak these backoff settings, or b) to use an exponential backoff at all.

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original context of this change had to do with informers getting stale info in the case of array nodes.

cc: @pvditt

to allow users to tweak these backoff settings

Do you mean we should validate that it's a strictly positive value, right?

to use an exponential backoff at all.

Can you expand on that?

flytepropeller/pkg/controller/controller.go Outdated Show resolved Hide resolved
},
}

assert.NoError(t, fakeKubeClient.GetClient().Create(ctx, o))

p.OnBuildIdentityResource(ctx, tctx.TaskExecutionMetadata()).Return(o, nil)
pluginManager := PluginManager{plugin: &p, kubeClient: fakeKubeClient}
pluginManager := PluginManager{plugin: &p, kubeClient: fakeKubeClient, updateBackoffRetries: 5}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original context of this change had to do with informers getting stale info in the case of array nodes.

cc: @pvditt

to allow users to tweak these backoff settings

Do you mean we should validate that it's a strictly positive value, right?

to use an exponential backoff at all.

Can you expand on that?

@trutx
Copy link
Contributor Author

trutx commented Dec 12, 2024

Do you mean we should validate that it's a strictly positive value, right?

That, or completely remove the ability to set an arbitrary value there. I can picture situations where a too short backoff expires when the k8s client inside the backoff func tries to talk to a busy/lagged/high network latency k8s apiserver.

Can you expand on that?

What I meant is we have to decide first if it's ok to leave to-be-deleted objects in the cluster with a finalizer nobody is going to remove, or not. If that's ok we don't need a backoff, we just need to try once and we have to make sure no ultrafast backoff gives up before the one try has even been performed.

But OTOH, if we don't want to leave any garbage behind maybe we should retry infinitely instead. I am up for exponentially spacing the retries not to hammer the apiserver, but I think such retries should keep happening until the removal finally succeeds.

However I think all of this is outside the scope of this PR :D

Signed-off-by: Roger Torrentsgenerós <rogert@spotify.com>
@pingsutw
Copy link
Member

pingsutw commented Jan 3, 2025

But OTOH, if we don't want to leave any garbage behind maybe we should retry infinitely instead. I am up for exponentially spacing the retries not to hammer the apiserver, but I think such retries should keep happening until the removal finally succeeds.

+1 to retry with backoff. We’ve noticed some pods staying in the namespace indefinitely, and we’ve had to manually remove them, which is not great.

Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, @pvditt mind taking another look as well

@pvditt
Copy link
Contributor

pvditt commented Jan 3, 2025

I'm in favor of keeping the current retries w/ backoff as is for now. If that update fails after the currently hardcoded retries, it should bubble up in an error.

We did run into issues as @pingsutw mentioned, but that was w/ an external service re-adding finalizers after propeller cleared them.

Thanks for cleaning up a lot of the finalizers logic - I agree that propeller should only remove flyte specific finalizers. Changes look great.

@pvditt pvditt merged commit 27c9edd into flyteorg:master Jan 4, 2025
51 of 52 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Housekeeping] Use domain-qualified finalizers
4 participants