Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing "kubeadm:cluster-admins already exists" error while running "kubeadm init phase mark-control-plane" step in K8s 1.29.3 version #3081

Closed
dhruvapg opened this issue Jun 28, 2024 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Milestone

Comments

@dhruvapg
Copy link

What keywords did you search in kubeadm issues before filing this one?

"kubeadm:cluster-admins" already exists
unable to create the kubeadm:cluster-admins ClusterRoleBinding by using super-admin.conf

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):
v1.29.3

Environment:

  • Kubernetes version (use kubectl version):
    v1.29.3
  • Cloud provider or hardware configuration:
  • esx
  • OS (e.g. from /etc/os-release):
  • photon os
  • Kernel (e.g. uname -a):
    Linux 422f37e8f2e2d83c5f4d6fd98e049586 5.10.216-1.ph4-esx
  • Container runtime (CRI) (e.g. containerd, cri-o):
  • v1.7.11
  • Container networking plugin (CNI) (e.g. Calico, Cilium):
  • Others:

What happened?

I'm hitting "clusterrolebindings.rbac.authorization.k8s.io kubeadm:cluster-admins already exists" error during kubeadm init phase mark-control-plane in K8s 1.29.3 version. Even if I delete clusterrolebinding manually, it gets created automatically by kubeadm in some sync loop and then kubeadm init phase mark-control-plane step fails with "clusterrolebinding already exists" error and is stuck in this error state.
I0627 08:12:10.094815 538426 kubeconfig.go:606] ensuring that the ClusterRoleBinding for the kubeadm:cluster-admins Group exists I0627 08:12:10.102894 538426 kubeconfig.go:682] creating the ClusterRoleBinding for the kubeadm:cluster-admins Group by using super-admin.conf clusterrolebindings.rbac.authorization.k8s.io "kubeadm:cluster-admins" already exists unable to create the kubeadm:cluster-admins ClusterRoleBinding by using super-admin.conf k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.EnsureAdminClusterRoleBindingImpl cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:708 k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.EnsureAdminClusterRoleBinding cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:595 k8s.io/kubernetes/cmd/kubeadm/app/cmd.(*initData).Client cmd/kubeadm/app/cmd/init.go:526 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runMarkControlPlane cmd/kubeadm/app/cmd/phases/init/markcontrolplane.go:60 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:259 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll cmd/kubeadm/app/cmd/phases/workflow/runner.go:446 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run cmd/kubeadm/app/cmd/phases/workflow/runner.go:232 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).BindToCommand.func1.1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:372 github.com/spf13/cobra.(*Command).execute vendor/github.com/spf13/cobra/command.go:940 github.com/spf13/cobra.(*Command).ExecuteC vendor/github.com/spf13/cobra/command.go:1068 github.com/spf13/cobra.(*Command).Execute vendor/github.com/spf13/cobra/command.go:992 k8s.io/kubernetes/cmd/kubeadm/app.Run cmd/kubeadm/app/kubeadm.go:50 main.main cmd/kubeadm/kubeadm.go:25 runtime.main /usr/local/go/src/runtime/proc.go:267 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1650 could not bootstrap the admin user in file admin.conf k8s.io/kubernetes/cmd/kubeadm/app/cmd.(*initData).Client cmd/kubeadm/app/cmd/init.go:528 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runMarkControlPlane cmd/kubeadm/app/cmd/phases/init/markcontrolplane.go:60 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:259 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll cmd/kubeadm/app/cmd/phases/workflow/runner.go:446 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run cmd/kubeadm/app/cmd/phases/workflow/runner.go:232 k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).BindToCommand.func1.1 cmd/kubeadm/app/cmd/phases/workflow/runner.go:372 github.com/spf13/cobra.(*Command).execute vendor/github.com/spf13/cobra/command.go:940 github.com/spf13/cobra.(*Command).ExecuteC vendor/github.com/spf13/cobra/command.go:1068 github.com/spf13/cobra.(*Command).Execute vendor/github.com/spf13/cobra/command.go:992 k8s.io/kubernetes/cmd/kubeadm/app.Run cmd/kubeadm/app/kubeadm.go:50 main.main cmd/kubeadm/kubeadm.go:25 runtime.main

kubeapi server audit logs show that kubeadm:cluster-admins clusterolebinding is automatically created by kubeadm running on the new control plane node.
audit/kube-apiserver.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"RequestResponse","auditID":"81d9e046-37ae-4814-9e4b-56e87cc05c56","stage":"ResponseComplete","requestURI":"/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?timeout=10s","verb":"create","user":{"username":"kubernetes-super-admin","groups":["system:masters","system:authenticated"]},"userAgent":"kubeadm/v1.29.3+(linux/amd64) kubernetes/4ab1a82","objectRef":{"resource":"clusterrolebindings","name":"kubeadm:cluster-admins","apiGroup":"rbac.authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestObject":{"kind":"ClusterRoleBinding","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"kubeadm:cluster-admins","creationTimestamp":null},"subjects":[{"kind":"Group","apiGroup":"rbac.authorization.k8s.io","name":"kubeadm:cluster-admins"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"cluster-admin"}},"responseObject":{"kind":"ClusterRoleBinding","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"kubeadm:cluster-admins","uid":"629da920-2bd3-4a98-9348-86708ccf6e4e","resourceVersion":"65240","creationTimestamp":"2024-06-27T06:54:24Z","managedFields":[{"manager":"kubeadm","operation":"Update","apiVersion":"rbac.authorization.k8s.io/v1","time":"2024-06-27T06:54:24Z","fieldsType":"FieldsV1","fieldsV1":{"f:roleRef":{},"f:subjects":{}}}]},"subjects":[{"kind":"Group","apiGroup":"rbac.authorization.k8s.io","name":"kubeadm:cluster-admins"}],"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"cluster-admin"}},"requestReceivedTimestamp":"2024-06-27T06:54:24.611747Z","stageTimestamp":"2024-06-27T06:54:24.617174Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}

What you expected to happen?

I expected mark-control-plane phase to handle clusterrolebinding already exists error gracefully and not return error. This issue is already fixed in 1.30 but not crossported to 1.29.

Is there a way to workaround this error before backporting kubernetes/kubernetes@ec1516b to 1.29 version ?

How to reproduce it (as minimally and precisely as possible)?

  1. Deployed kubernetes cluster with 1.29.3 version
  2. Took etcd snapshot backup using etcdctl
  3. Deleted control plane nodes
  4. Restored etcd snapshot using etcdtl
  5. Invoked kubeadm init phase
  6. "kubeadm init phase mark-control-plane" step fails with clusterrolebinding already exists error

Anything else we need to know?

When I faced the same error during "kubeadm init phase upload-config all", I added "kubectl delete clusterrolebinding kubeadm:cluster-admins" command to delete it before the above step, then I was able to resolve the error and move to the next step. However, the same workaround is helpful during mark-control-plane phase.

@neolit123
Copy link
Member

neolit123 commented Jun 28, 2024

Deployed kubernetes cluster with 1.29.3 version
Took etcd snapshot backup using etcdctl
Deleted control plane nodes
Restored etcd snapshot using etcdtl
Invoked kubeadm init phase
"kubeadm init phase mark-control-plane" step fails with clusterrolebinding already exists error

calling kubeadm init or join on an existing etcd data dir from /var/lib/etcd is not really supported or tested.
so you might have to skip the mark-control-plane phase and manually apply what it does to workaround the problem.

the correct way to do this type of restore is to:

  • delete one existing CP node, kubeadm join a new CP node
  • repeat until you have replaced all old CP nodes.

in terms of why it's failing, i'm a bit confused. in 1.29 we already check if the CRB exists:
https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L653C1-L657C13

and then we exit without an error:
https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L669-L672

can you show the output of kubeadm init phase mark-control-plane --v=10 when the error is happening?

use pastebin or github gists to share the full output.

@neolit123 neolit123 added kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jun 28, 2024
@neolit123 neolit123 added this to the v1.29 milestone Jun 28, 2024
@dhruvapg
Copy link
Author

dhruvapg commented Jul 1, 2024

can you show the output of kubeadm init phase mark-control-plane --v=10 when the error is happening?

https://gist.github.com/dhruvapg/84d2d3b8cd0c81c114bf57db0b634281

Also attaching the output of the same command invoked with kubeadm config file:
root@422f37e8f2e2d83c5f4d6fd98e049586 [ ~ ]# kubeadm init phase mark-control-plane --config=/etc/k8s/kubeadm.yaml --rootfs=/ --v=10
https://gist.github.com/dhruvapg/cd403364523e0e300875450b3cfe6337

in terms of why it's failing, i'm a bit confused. in 1.29 we already check if the CRB exists:
https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L653C1-L657C13

and then we exit without an error:
https://github.com/kubernetes/kubernetes/blob/release-1.29/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L669-L672

In my case, during etcd restore, apiserver is configured to deny all APIs (except from privileged users/groups), since RBAC is disabled for cluster-admin.conf, it's failing when creating clusterolebinding with super-admin.conf client:
https://github.com/kubernetes/kubernetes/blob/v1.29.6/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L697
And returning error from here: https://github.com/kubernetes/kubernetes/blob/v1.29.6/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L708

The CRB exists error for super-admin.conf is already fixed in 1.30 but not crossported to 1.29

@neolit123
Copy link
Member

thanks for the info.

In my case, during etcd restore, apiserver is configured to deny all APIs (except from privileged users/groups), since RBAC is disabled for cluster-admin.conf, it's failing when creating clusterolebinding with super-admin.conf client:

that's not a supported or tested scenario, but i can imagine users are doing similar actions.
could you explain why are you restoring from backup in a similar way? is that an automated process supported in your stack or are you doing this as a one-off?

The CRB exists error for super-admin.conf is already fixed in 1.30 but not crossported to 1.29

so if we backport that PR to 1.29 it would be a fix for you?

@neolit123
Copy link
Member

here is the fix backport
kubernetes/kubernetes#125821

that can be available in the next 1.29.x if the release managers do not miss it.

@neolit123 neolit123 added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Jul 1, 2024
@dhruvapg
Copy link
Author

dhruvapg commented Jul 1, 2024

Thanks for backporting the fix to 1.29.x

could you explain why are you restoring from backup in a similar way? is that an automated process supported in your stack or are you doing this as a one-off?

Restoring control plane to a backed up k8s version is a feature we began supporting, we disable webhooks and RBAC for apiserver until the restore operation is completed to avoid any undefined states. This was working fine till 1.28 since the default admin.conf was bound to system:masters Group that could bypass RBAC, it started breaking in 1.29 with the separation of cluster-admin.conf and super-admin.conf.

so if we backport that PR to 1.29 it would be a fix for you?

I haven't verified in 1.30, but I think it would fix the issue.
Earlier, I ran into the same error during kubeadm init phase upload-config all, I could workaround it by deleting the clusterrolebinding but the same hack didn't work for kubeadm init phase mark-control-plane
So hopefully, this backport fix should help to handle apierrors.IsAlreadyExists error gracefully in these scenarios.

@neolit123
Copy link
Member

neolit123 commented Jul 1, 2024

This was working fine till 1.28 since the default admin.conf was bound to system:masters Group that could bypass RBAC, it started breaking in 1.29 with the separation of cluster-admin.conf and super-admin.conf.

if the feature you support requires the system:masters group which bypases RBAC you might have to maintain an admin.conf that continues to bind to system:masters. if you populate an admin.conf, kubeadm init/join will respect it, but kubeadm's cert rotation (e.g. on upgrade) will convert it back to a cluster-admin role.

@neolit123
Copy link
Member

here is the fix backport kubernetes/kubernetes#125821

that can be available in the next 1.29.x if the release managers do not miss it.

fixed in 1.29.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

No branches or pull requests

2 participants