Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent memory leak of k3s control plane instance #10922

Open
craigcabrey opened this issue Sep 21, 2024 · 3 comments
Open

Persistent memory leak of k3s control plane instance #10922

craigcabrey opened this issue Sep 21, 2024 · 3 comments

Comments

@craigcabrey
Copy link

craigcabrey commented Sep 21, 2024

Environmental Info:
K3s Version: v1.29.8

[root@venus-node-3 /]# k3s -v
k3s version v1.29.8+k3s1 (33fdc35d)
go version go1.22.5

Node(s) CPU architecture, OS, and Version:
Linux venus-node-3 6.10.6-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 19 14:09:30 UTC 2024 x86_64 GNU/Linux

cmdline:

BOOT_IMAGE=(hd1,gpt3)/ostree/fedora-coreos-2e3c9b45afc73fe355d3ea8072021317aaa8909e49a642167e41e9abab5babbf/vmlinuz-6.10.6-200.fc40.x86_64 mitigations=auto,nosmt ignition.platform.id=metal ostree=/ostree/boot.0/fedora-coreos/2e3c9b45afc73fe355d3ea8072021317aaa8909e49a642167e41e9abab5babbf/0 root=UUID=bb46740c-4cdf-409f-b3e2-aa78cd0a3a36 rw rootflags=prjquota boot=UUID=a6f11c3a-d6b7-43ba-8314-9377ad17d709
[craigcabrey@tealboi ~]$ k get node -o wide
NAME           STATUS                     ROLES                       AGE    VERSION        INTERNAL-IP                            EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION           CONTAINER-RUNTIME
ms01-node-1    Ready                      <none>                      55d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:7a2a:376a:8404:ded1   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
ms01-node-2    Ready                      <none>                      112d   v1.29.8+k3s1   fdb5:12c1:f8cb:0:7cdd:1855:7d3b:5ee2   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
ms01-node-3    Ready                      <none>                      112d   v1.29.8+k3s1   fdb5:12c1:f8cb:0:851e:d111:ad0e:473d   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
ms01-node-4    Ready                      <none>                      34d    v1.29.6+k3s2   fdb5:12c1:f8cb:0:218a:a6ca:5c7f:f5d4   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.17-k3s1
ms01-node-5    Ready                      <none>                      54d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:f67:d816:2aea:e002    <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
ms01-node-6    Ready                      <none>                      54d    v1.29.6+k3s2   fdb5:12c1:f8cb:0:5fe8:9d4c:eb99:b755   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.17-k3s1
ms01-node-7    Ready                      <none>                      39d    v1.29.6+k3s2   fdb5:12c1:f8cb:0:7f46:1377:653:e6f5    <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.17-k3s1
pi-node-1      Ready                      control-plane,etcd,master   34d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:560b:4bca:422c:f49    <none>        Debian GNU/Linux 12 (bookworm)   6.6.39-v8-16k+           containerd://1.7.20-k3s1
pi-node-2      Ready                      control-plane,etcd,master   56d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:f507:7bda:a045:ad62   <none>        Debian GNU/Linux 12 (bookworm)   6.6.39-v8-16k+           containerd://1.7.20-k3s1
pi-node-3      Ready                      control-plane,etcd,master   34d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:857f:41e8:ff46:fe9f   <none>        Debian GNU/Linux 12 (bookworm)   6.6.40-v8-16k+           containerd://1.7.20-k3s1
venus-node-1   Ready                      <none>                      114d   v1.29.6+k3s2   fdb5:12c1:f8cb:0:63a8:555c:1d73:2b8b   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.17-k3s1
venus-node-2   Ready                      control-plane,etcd,master   27d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:49bd:e1a7:a96b:9501   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1
venus-node-3   Ready,SchedulingDisabled   control-plane,etcd,master   28d    v1.29.8+k3s1   fdb5:12c1:f8cb:0:97ed:8915:ab57:61d4   <none>        Fedora CoreOS 40.20240825.3.0    6.10.6-200.fc40.x86_64   containerd://1.7.20-k3s1

Cluster Configuration:

$ cat /etc/rancher/k3s/config.yaml
server: https://[fdb5:12c1:f8cb:0:560b:4bca:422c:f49]:6443
embedded-registry: true
secrets-encryption: true
disable:
  - servicelb
  - traefik

cluster-cidr: fdb5:12c1:f8cb:dead:beaf::/96,10.42.0.0/16
service-cidr: fdb5:12c1:f8cb:dead:c0de::/108,10.43.0.0/16

flannel-backend: wireguard-native
flannel-ipv6-masq: true
flannel-iface: internal

node-ip: fdb5:12c1:f8cb:0:97ed:8915:ab57:61d4,192.168.7.13

kubelet-arg:
  - "node-ip=::"                                                                                                                                                                                                                                                                                                                  kube-controller-manager-arg:
  - node-cidr-mask-size-ipv6=108                                                                                                                                                                                                                                                                                                  # https://docs.k3s.io/cli/server#listeners
tls-san: k8s.internal.lan
write-kubeconfig-mode: "0644"

Describe the bug:

Steps To Reproduce:

  • Installed K3s: v1.29.8

I have a simple drop in:

[root@venus-node-3 /]# cat /etc/systemd/system/k3s.service.d/99-custom.conf
[Service]
ExecStartPre=-/usr/sbin/ip link property add dev internal altname vlan-provider

Expected behavior:

I have a control plane node that runs out of memory after 1-2 days. I've experimented a bit, and this happens even when the node is cordoned and minimal pods are running.

Actual behavior:

Memory usage of k3s & containerd grows over a 1-2 day period to consume all memory on the host. This happened on v1.29.6 as well, upgraded to v1.29.8 but no change was observed.

Additional context / logs:

I also have below logs (similar to atop if you aren't familiar) showing the cgroup & process level stats over a 24h+ period. You can see the RSS grow over the period uncontrollably:

~10 hours ago:

Screenshot 2024-09-21 134317

~15 minutes ago:

Screenshot 2024-09-21 134434

Grafana stats:

Screenshot 2024-09-21 133643

@craigcabrey
Copy link
Author

Process view right before I ran systemctl restart k3s:

Screenshot 2024-09-21 134626

@craigcabrey
Copy link
Author

Process view ~10 hours ago:

Screenshot 2024-09-21 134744

This shows that both k3s & containerd are growing. These views are tracking the systemd cgroup slices, not the workloads. So there should be no contamination of workload behavior on these stats (but separately, as noted above I minimized the number of pods running on this node & checked the stats of said pods -- all were within reason).

@craigcabrey
Copy link
Author

And process view after restarting:

Screenshot 2024-09-21 135324

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

No branches or pull requests

1 participant