Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix using KNative with ArgoCD (don't set ownerReferences for webhooks) #15483

Open
thesuperzapper opened this issue Aug 24, 2024 · 13 comments · Fixed by knative/pkg#3095
Open
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@thesuperzapper
Copy link

Background

In 2021 KNative made it so that the ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources were "owned" by the Namespace which KNative is installed into (kanative-serving by default):

This was intended to ensure that users did not leave the webhooks when uninstalling (via deleting the Namespace) and break KNative when they re-installed (because the webhooks are cluster resources, and their backend would not exist and so would fail to validate anything).

Kubernetes has the concept of ownerReferences to indicate the relationships between resources. If an ownerReference sets blockOwnerDeletion, kubernetes will clean up these "child" resources before/after the "parent" resources is deleted (before: foreground delete, after: background delete).

For example, a Pod owned by a ReplicaSet might have the following ownerReferences:

apiVersion: v1
kind: Pod
metadata:
  name: xxxxxx
  namespace: xxxxxx
  ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: ReplicaSet
      name: xxxxxx-759d8cb89
      uid: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Whats the problem?

This breaks ArgoCD, a very widely used GitOps system for Kubernetes.

Specifically, ArgoCD will never remove a resource that has ownerReferences set, so the issue we were trying to prevent actually happens 100% of the time when deploying KNative with ArgoCD.

Here are some related upstream issues:

What's the solution?

I propose we make two changes:

  1. Add a config which will disable the behavior of setting the ownerReferences:
    • For example, an environment variable like KNATIVE_DISABLE_WEBHOOK_OWNER which can be set on all controller pods.
  2. When we do set ownerReferences, we should NOT be setting controller=true:
    • This is much less likely to break downstream projects (obviously the Namespace is not the controlling resource). It also allows KNative to work with the upstream patch for ArgoCD that ignores non-controller ownerReferences.

Where is the relevant code?

The code which sets the ownerReferences lives in the knative-pkg libraries:

@thesuperzapper thesuperzapper added the kind/bug Categorizes issue or PR as related to a bug. label Aug 24, 2024
@thesuperzapper
Copy link
Author

@dprotaso @skonto @creydr thanks for you work on KNative! I wanted to highlight this important issue as it causes KNative to break when deployed with ArgoCD because the webhooks are not removed on uninstall.

@dprotaso
Copy link
Member

Hey @thesuperzapper I wanted to confirm something

Specifically, ArgoCD will never remove a resource that has ownerReferences set, so the issue we were trying to prevent actually happens 100% of the time when deploying KNative with ArgoCD.

Shouldn't ArgoCD delete the namespace (since it has no owner reference) and then this should propagate and then the webhooks should eventually be deleted?

@thesuperzapper
Copy link
Author

Hey @thesuperzapper I wanted to confirm something

Specifically, ArgoCD will never remove a resource that has ownerReferences set, so the issue we were trying to prevent actually happens 100% of the time when deploying KNative with ArgoCD.

Shouldn't ArgoCD delete the namespace (since it has no owner reference) and then this should propagate and then the webhooks should eventually be deleted?

@dprotaso there are a few issues with that:

  1. Lots of Argo CD users provision namespaces in a separate application (so deleting the kNative app wouldn't delete it, hence the deadlock)
  2. Even if the application did contain the namespace, I think Argo CD would try and delete non-namespace resources first, still leading to a deadlock.

@dprotaso
Copy link
Member

Unsure if pkg/3095 closes this issue - since there's no end-user knob to disable the owner references

@dprotaso dprotaso reopened this Sep 29, 2024
@dprotaso
Copy link
Member

Lots of Argo CD users provision namespaces in a separate application (so deleting the kNative app wouldn't delete it, hence the deadlock)

Interesting is this documented anywhere? or is there some guide that recommends this?

Even if the application did contain the namespace, I think Argo CD would try and delete non-namespace resources first, still leading to a deadlock.

So you said before argo wouldn't delete resources with owner references - so does it error out or skip the resource?

@thesuperzapper
Copy link
Author

Lots of Argo CD users provision namespaces in a separate application (so deleting the kNative app wouldn't delete it, hence the deadlock)

Interesting is this documented anywhere? or is there some guide that recommends this?

This is a common pattern, especially in app-of-apps deployments.

Either way, ArgoCD is the most widest used gitOps tool in the world, so if we want people to use kNative, we need to be flexible with it.

Even if the application did contain the namespace, I think Argo CD would try and delete non-namespace resources first, still leading to a deadlock.

So you said before argo wouldn't delete resources with owner references - so does it error out or skip the resource?

@dprotaso think you might be slightly misunderstanding.

What is happening is that ArgoCD will never delete the ValidatingWebhookConfiguration / MutatingWebhookConfiguration resources (because they have the owner reference), however, it will delete the other resources, including the back-end services for the webhooks.

This means that because the webhooks have a failurePolicy: Fail, when that the back end is deleted, any operations which match the webhook selectors are now blocked (because the backend is down).

The easiest way to reproduce this is simply create an ArgoCD app for KNative that has the resources-finalizer.argocd.argoproj.io cascading deletion finalizer, sync it, then try delete it.

This cascading delete will deadlock 100% of the time in the way described.

@thesuperzapper
Copy link
Author

@dprotaso I have made a follow-up PR in knative/pgk that adds an environment variable to disable the Namespace owner reference from being set in KNative Serving:

It adds an environment variable called WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP which sets the DisableNamespaceOwnership config that was added by @jonathan-innis in:

Next Steps

I think we should also backport it to 1.14 and 1.15 because otherwise its nearly impossible to deploy KNative Serving with ArgoCD. We might also want to add a basic reference to this in the docs (or hope people stumble across this issue).

That means we will need to cherry-pick the two PRs into the release-1.14 and release-1.15 of knative/pgk, as we can't just update the versions used to master.

We need to update the version of knative/pkg used in the following places:

@dprotaso
Copy link
Member

Either way, ArgoCD is the most widest used gitOps tool in the world, so if we want people to use kNative, we need to be flexible with it.

I'm good with a toggle in knative.dev/pkg but I want to clarify my position that this is really seems like an ArgoCD bug. You can't expect all OSS projects to change their behaviour because of a quirk in a CD tool. I would continue pushing on that project to do the right thing.

eg. Long term it would be great if we could do the ownership references declaratively kubernetes/kubernetes#102810

I don't think this warrants a cherry pick and cutting new releases. Per our release schedule we'll have a new one out in a week. https://github.com/knative/community/blob/main/mechanics/RELEASE-SCHEDULE.md

@thesuperzapper
Copy link
Author

I'm good with a toggle in knative.dev/pkg but I want to clarify my position that this is really seems like an ArgoCD bug. You can't expect all OSS projects to change their behaviour because of a quirk in a CD tool. I would continue pushing on that project to do the right thing.

@dprotaso I think that both KNative and ArgoCD are doing something unexpected here.

While I agree that ArgoCD should have a workaround for apps that misuse the OwnerReferences, I think that we (KNative) are actually misusing OwnerRefernces by setting the controller: true flag. That is, the Namespace is not the "controlling" resource for the webhook configuration, so it's confusing ArgoCD which assumes that people never want to delete a resource that is controlled by another.

ArgoCD does this because deleting an owned resource is usually dangerous. For example, when using cert-manager, ArgoCD should never delete a secret that is managed by a Certificate resource (which is indicated by the Certificate "owning" the Secret).

eg. Long term it would be great if we could do the ownership references declaratively kubernetes/kubernetes#102810

I doubt that upstream Kubernetes would accept this because the point of OwnerRefernces is for controllers, not end users.

I don't think this warrants a cherry pick and cutting new releases. Per our release schedule we'll have a new one out in a week. https://github.com/knative/community/blob/main/mechanics/RELEASE-SCHEDULE.md

I really think this is a bug (from the KNative perspective) and since we are still officially supporting 1.14 it warrants a patch release.

@dprotaso
Copy link
Member

Thinking about this more I wonder if we really just want to set the controller to false but leave the owner reference.

Does that appease ArgoCD?

@dprotaso
Copy link
Member

Looks like there's already a PR out for that - argoproj/gitops-engine#503

@thesuperzapper
Copy link
Author

Thinking about this more I wonder if we really just want to set the controller to false but leave the owner reference.

Does that appease ArgoCD?

@dprotaso sadly no, but we should still do that change anyway, because it's more semantically correct. At some point ArgoCD might change to ignoring non-controller owners (but there is no consensus in the ArgoCD community yet on if this would introduce it's own problems).

So in the meantime, it would be great to get knative/pkg#3103 merged at at least made part of 1.15 so people can deploy kNative safely from ArgoCD.

@thesuperzapper
Copy link
Author

@dprotaso there was a slight issue with my first PR, but I have made a quick followup knative/pkg#3107 which definitely works in my testing.

That is, when the WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP env-var is set to "true" on the "webhook" Deployments (running versions of Knative Serving with this patch), the ownerReferences are not set on the webhook configuration resources.

For example, you can apply the following Kustomize patch to do this:

patches:
  - patch: |-
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: REPLACED_DURING_PATCH
      spec:
        template:
          spec:
            containers:
              - name: webhook
                env:
                  - name: WEBHOOK_DISABLE_NAMESPACE_OWNERSHIP
                    value: "true"
    target:
      group: apps
      kind: Deployment
      name: ".*webhook.*"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
2 participants