-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deploying a fresh cluster via helm chart is stuck at bootstrap-yaml-envsubst
init container
#1564
Comments
currently working around this by using a pulumi chart transform to remove the strikes me as rather odd that this container which was recently introduced produces no logs, and you cannot exec into the container to try and debug otherwise you get an OCI error because bash/zsh/dash/sh are all not found in PATH. this container probably needs to be battle-tested a little more if the plan is to keep it in the helm chart |
The container in question uses distroless as a base so the inability to exec is expected. The lack of logs is expected as well actually because the container is only reading and then writing a file (it's effectively just We've not seen any issues with hanging thus far. Could you share any relevant output from |
This should only be affecting If possible, would you mind trying out 5.9.5 to see if anything similarly strange happens? |
Ah, |
Hi Chris, thanks for the reply. I've been trying all morning to find as much info as I can. Here's what I have so far
Here is a log of pod statuses when deploying successfully via helm CLI. There was only one time when the helm CLI did not produce the configuration job/container, which caused the main redpanda pod to get stuck at "Init:2/3"... however, even in this log below of a successful rollout, isn't it strange that the 3rd init container is taking more than 10 seconds? Does this I've only been able to successfully rollout twice with pulumi-kubernetes. I'm not sure why the success rate for pulumi is lower, I suspect they may be using an older version of helm internally?
Here is the values.yaml I'm using:
I ran the commands you've provided above and did a diff between a successful rollout and a failed one, but there were no significant differences Not sure where to go from here, I'm a little confused as to which tool is causing the error as it seems strange to me that I've seen both successful and failed rollouts from both the helm CLI and the pulumi CLI. |
code available here: |
I've opened a bug with pulumi-kubernetes: pulumi/pulumi-kubernetes#3265 is it possible to get some logging from the |
The deployment failure is not occurring 100% of the time, suggesting that this is a toolchain issue - not an issue with the deployment values |
Thanks continuing to dig into this! Here's the source of the envsubst container: https://github.com/redpanda-data/redpanda-operator/blob/main/operator/cmd/envsubst/envsubst.go. Disk IO is really the only thing its doing that could be hanging. Could you run |
This is likely a red herring. The configuration job will only run after the StatefulSet becomes Ready which is prevented if the init container hangs. |
Name: redpanda-ceda518c-0
Namespace: default
Priority: 0
Service Account: default
Node: orbstack/198.19.249.2
Start Time: Thu, 17 Oct 2024 03:02:11 +1000
Labels: app.kubernetes.io/component=redpanda-statefulset
app.kubernetes.io/instance=redpanda-ceda518c
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=redpanda
apps.kubernetes.io/pod-index=0
controller-revision-hash=redpanda-ceda518c-74f6484f4b
helm.sh/chart=redpanda-5.9.7
redpanda.com/poddisruptionbudget=redpanda-ceda518c
statefulset.kubernetes.io/pod-name=redpanda-ceda518c-0
Annotations: config.redpanda.com/checksum: f05f9a3c004ec98ce95cc606dac824014cc5bd64cb1c44fe8f6968b659c6d979
Status: Pending
IP: 192.168.194.42
IPs:
IP: 192.168.194.42
IP: fd07:b51a:cc66:a::2a
Controlled By: StatefulSet/redpanda-ceda518c
Init Containers:
tuning:
Container ID: docker://e1c9a6bbcfcc27f6f363817b5f8fecaa407469578ea8f45f90d8aad6e277d77a
Image: docker.redpanda.com/redpandadata/redpanda:v24.2.7
Image ID: docker-pullable://docker.redpanda.com/redpandadata/redpanda@sha256:82a69763bef8d8b55ea5a520fa1b38f993908ef68946819ca1aed43541824c48
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
rpk redpanda tune all
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 17 Oct 2024 03:02:12 +1000
Finished: Thu, 17 Oct 2024 03:02:12 +1000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/etc/redpanda from base-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
redpanda-configurator:
Container ID: docker://1de50536c1b7eac009558a7bc1fd0c0b61b2bbf0ce6eecf44e01c34e4f0ee612
Image: docker.redpanda.com/redpandadata/redpanda:v24.2.7
Image ID: docker-pullable://docker.redpanda.com/redpandadata/redpanda@sha256:82a69763bef8d8b55ea5a520fa1b38f993908ef68946819ca1aed43541824c48
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
trap "exit 0" TERM; exec $CONFIGURATOR_SCRIPT "${SERVICE_NAME}" "${KUBERNETES_NODE_NAME}" & wait $!
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 17 Oct 2024 03:02:13 +1000
Finished: Thu, 17 Oct 2024 03:02:13 +1000
Ready: True
Restart Count: 0
Environment:
CONFIGURATOR_SCRIPT: /etc/secrets/configurator/scripts/configurator.sh
SERVICE_NAME: redpanda-ceda518c-0 (v1:metadata.name)
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
HOST_IP_ADDRESS: (v1:status.hostIP)
Mounts:
/etc/redpanda from config (rw)
/etc/secrets/configurator/scripts/ from redpanda-ceda518c-configurator (rw)
/tmp/base-config from base-config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
bootstrap-yaml-envsubst:
Container ID: docker://37eb987a144d998148954cb49183953b5c5f1d70197736abeb19f3dd71eba8e4
Image: docker.redpanda.com/redpandadata/redpanda-operator:v2.2.4-24.2.5
Image ID: docker-pullable://docker.redpanda.com/redpandadata/redpanda-operator@sha256:17979d5443f420a1791edb067149d841bb8251c534e1c289a8fbc11392a7aca2
Port: <none>
Host Port: <none>
Command:
/redpanda-operator
envsubst
/tmp/base-config/bootstrap.yaml
--output
/tmp/config/.bootstrap.yaml
State: Running
Started: Thu, 17 Oct 2024 03:02:14 +1000
Ready: False
Restart Count: 0
Limits:
cpu: 100m
memory: 25Mi
Requests:
cpu: 100m
memory: 25Mi
Environment: <none>
Mounts:
/tmp/base-config/ from base-config (rw)
/tmp/config/ from config (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
Containers:
redpanda:
Container ID:
Image: docker.redpanda.com/redpandadata/redpanda:v24.2.7
Image ID:
Ports: 9644/TCP, 8082/TCP, 9093/TCP, 9094/TCP, 33145/TCP, 8081/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
rpk
redpanda
start
--advertise-rpc-addr=$(SERVICE_NAME).redpanda-ceda518c.default.svc.cluster.local.:33145
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Limits:
cpu: 1
memory: 2560Mi
Requests:
cpu: 1
memory: 2560Mi
Liveness: exec [/bin/sh -c curl --silent --fail -k -m 5 "http://${SERVICE_NAME}.redpanda-ceda518c.default.svc.cluster.local.:9644/v1/status/ready"] delay=10s timeout=1s period=10s #success=1 #failure=3
Readiness: exec [/bin/sh -c set -x
RESULT=$(rpk cluster health)
echo $RESULT
echo $RESULT | grep 'Healthy:.*true'
] delay=1s timeout=1s period=10s #success=1 #failure=3
Startup: exec [/bin/sh -c set -e
RESULT=$(curl --silent --fail -k -m 5 "http://${SERVICE_NAME}.redpanda-ceda518c.default.svc.cluster.local.:9644/v1/status/ready")
echo $RESULT
echo $RESULT | grep ready
] delay=1s timeout=1s period=10s #success=1 #failure=120
Environment:
SERVICE_NAME: redpanda-ceda518c-0 (v1:metadata.name)
POD_IP: (v1:status.podIP)
HOST_IP: (v1:status.hostIP)
Mounts:
/etc/redpanda from config (rw)
/tmp/base-config from base-config (rw)
/var/lib/redpanda/data from datadir (rw)
/var/lifecycle from lifecycle-scripts (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
config-watcher:
Container ID:
Image: docker.redpanda.com/redpandadata/redpanda:v24.2.7
Image ID:
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
trap "exit 0" TERM; exec /etc/secrets/config-watcher/scripts/sasl-user.sh & wait $!
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/etc/redpanda from config (rw)
/etc/secrets/config-watcher/scripts from redpanda-ceda518c-config-watcher (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
datadir:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: datadir-redpanda-ceda518c-0
ReadOnly: false
lifecycle-scripts:
Type: Secret (a volume populated by a Secret)
SecretName: redpanda-ceda518c-sts-lifecycle
Optional: false
base-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: redpanda-ceda518c
Optional: false
config:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
redpanda-ceda518c-configurator:
Type: Secret (a volume populated by a Secret)
SecretName: redpanda-ceda518c-configurator
Optional: false
redpanda-ceda518c-config-watcher:
Type: Secret (a volume populated by a Secret)
SecretName: redpanda-ceda518c-config-watcher
Optional: false
kube-api-access-dtgbc:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints: topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/component=redpanda-statefulset,app.kubernetes.io/instance=redpanda-ceda518c,app.kubernetes.io/name=redpanda
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 71s default-scheduler Successfully assigned default/redpanda-ceda518c-0 to orbstack
Normal Pulled 71s kubelet Container image "docker.redpanda.com/redpandadata/redpanda:v24.2.7" already present on machine
Normal Created 71s kubelet Created container tuning
Normal Started 71s kubelet Started container tuning
Normal Pulled 70s kubelet Container image "docker.redpanda.com/redpandadata/redpanda:v24.2.7" already present on machine
Normal Created 70s kubelet Created container redpanda-configurator
Normal Started 70s kubelet Started container redpanda-configurator
Normal Pulled 69s kubelet Container image "docker.redpanda.com/redpandadata/redpanda-operator:v2.2.4-24.2.5" already present on machine
Normal Created 69s kubelet Created container bootstrap-yaml-envsubst
Normal Started 69s kubelet Started container bootstrap-yaml-envsubst |
I just noticed these very small limits on the init container that is stuck Limits:
cpu: 100m
memory: 25Mi
Requests:
cpu: 100m
memory: 25Mi coming from helm-charts/charts/redpanda/post_install_upgrade_job.go Lines 55 to 64 in 56e8cc7
given the flaky behaviour, im leaning towards this being an issue where the limit is reached - and perhaps this is an Orbstack issue where it is not providing any event or failure handling? |
I wouldn't expect that container to use up anything more than those resources. It's intentionally very lightweight (aside from the hang your seeing that is). Finally tracked down the option to run VZ + rosetta VMs in the docker distro I use (We develop on macOS) and haven't run into any issues there. I'll try to carve out some time to spin up a cluster on orbstack before the end of the week. Oh, are you using orbstack's builtin Kubernetes distro? I generally run K8s via kind or k3d, that might have an effect as well. |
So I can reliably reproduce this once in orbstack. There's an initial hang on redpanda-0 and the configuration job (which also runs this container). After the initial hang, everything starts to work just fine. I can restart the pod without any additional delay and oddly, after the delay the container runs fine as well. Stranger still we actually see logs from the configuration pod (which is a bug actually):
I can't reproduce this pause via
Figured it out 🤦 though I have many many questions about how and why we're seeing this exact behavior. I misremembered the binary size and was taken aback when double checking it was indeed the aarch64 binary causing issues. The binary is ~80MB, much larger than the memory limit. You can reproduce this on orbstack's docker by running: # Long stall
docker run -m 25MB --entrypoint=/redpanda-operator localhost/redpanda-operator:dev envsubst --help
# Immediately runs
docker run -m 100MB --entrypoint=/redpanda-operator localhost/redpanda-operator:dev envsubst --help I'll bump the limit to 100MB in the next release and drop something into the backlog to spit the binaries apart so the overall footprint is lower but still within a single container. Thank you again for your patience and my apologies for initially dismissing the constrained resources. |
that's interesting how the binary is 80Mb, the limit is only 25Mb but it worked some of the time via pulumi and almost always via helm. I wonder if there are differences in how both those CLIs deploy a helm chart, and whether the limit check is done via polling metrics. If polling, this would explain why sometimes the deployment would work. I assume the container completed between polls? |
Prior to this commit the memory limits of the `bootstrap-yaml-envsubst` container were set to 25Mi. This value was accidentally lower than the total size of the binary itself, 80Mi, which seemingly surfaced as unexplained hangs when _initially_ run on aarch64 (specifically using obstack). This commit bumps the limit to 125Mi to ensure adequate headroom which seems to mitigate such hangs. The exact mechanisms at play here at not well known. Fixes #1564
hi @chrisseto - thanks for the fix! 🙌 any thoughts on why this only occurs in orbstack? i don't seem to be running into this problem in other k8s runtimes. i'm also still unsure why its so easily reproducible with pulumi, and less so when using the helm CLI directly |
I honestly have no idea. We're all pretty stumped on this one. It could very well be a quick of arm64 vs amd64 or obstack itself. Unfortunately, I can't really justify the time to dig deeper as we have a workable solution and the check to make sure this doesn't happen again seems to be trivial (bin size <= mem limit). Keep us in the loop if you make any discoveries! |
What happened?
Title
What did you expect to happen?
Works
How can we reproduce it (as minimally and precisely as possible)?. Please include values file.
Anything else we need to know?
No response
Which are the affected charts?
Redpanda
Chart Version(s)
Cloud provider
JIRA Link: K8S-388
The text was updated successfully, but these errors were encountered: