Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying a fresh cluster via helm chart is stuck at bootstrap-yaml-envsubst init container #1564

Closed
mattfysh opened this issue Oct 15, 2024 · 17 comments · Fixed by #1570
Closed

Comments

@mattfysh
Copy link

mattfysh commented Oct 15, 2024

What happened?

Title

What did you expect to happen?

Works

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

$ helm get values <redpanda-release-name> -n <redpanda-release-namespace> --all
affinity: {}
auditLogging:
  clientMaxBufferSize: 16777216
  enabled: false
  enabledEventTypes: null
  excludedPrincipals: null
  excludedTopics: null
  listener: internal
  partitions: 12
  queueDrainIntervalMs: 500
  queueMaxBufferSizePerShard: 1048576
  replicationFactor: null
auth:
  sasl:
    bootstrapUser:
      mechanism: SCRAM-SHA-256
    enabled: true
    mechanism: SCRAM-SHA-512
    secretRef: sasl-b5af2352
    users: []
clusterDomain: cluster.local
commonLabels: {}
config:
  cluster: {}
  node:
    crash_loop_limit: 5
  pandaproxy_client: {}
  rpk: {}
  schema_registry_client: {}
  tunable:
    compacted_log_segment_size: 67108864
    kafka_connection_rate_limit: 1000
    log_segment_size_max: 268435456
    log_segment_size_min: 16777216
    max_compacted_log_segment_size: 536870912
connectors:
  deployment:
    create: false
  enabled: false
  test:
    create: false
console:
  affinity: {}
  annotations: {}
  automountServiceAccountToken: true
  autoscaling:
    enabled: false
    maxReplicas: 100
    minReplicas: 1
    targetCPUUtilizationPercentage: 80
  commonLabels: {}
  config: {}
  configmap:
    create: false
  console:
    config: {}
  deployment:
    create: false
  enabled: true
  enterprise:
    licenseSecretRef:
      key: ""
      name: ""
  extraContainers: []
  extraEnv: []
  extraEnvFrom: []
  extraVolumeMounts: []
  extraVolumes: []
  fullnameOverride: ""
  global: {}
  image:
    pullPolicy: IfNotPresent
    registry: docker.redpanda.com
    repository: redpandadata/console
    tag: ""
  imagePullSecrets: []
  ingress:
    annotations: {}
    className: null
    enabled: false
    hosts:
    - host: chart-example.local
      paths:
      - path: /
        pathType: ImplementationSpecific
    tls: []
  initContainers:
    extraInitContainers: ""
  livenessProbe:
    failureThreshold: 3
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  nameOverride: ""
  nodeSelector: {}
  podAnnotations: {}
  podLabels: {}
  podSecurityContext:
    fsGroup: 99
    runAsUser: 99
  priorityClassName: ""
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  replicaCount: 1
  resources: {}
  secret:
    create: false
    enterprise: {}
    kafka: {}
    login:
      github: {}
      google: {}
      jwtSecret: ""
      oidc: {}
      okta: {}
    redpanda:
      adminApi: {}
  secretMounts: []
  securityContext:
    runAsNonRoot: true
  service:
    annotations: {}
    port: 8080
    targetPort: null
    type: ClusterIP
  serviceAccount:
    annotations: {}
    automountServiceAccountToken: true
    create: true
    name: ""
  strategy: {}
  tests:
    enabled: true
  tolerations: []
  topologySpreadConstraints: []
enterprise:
  license: ""
  licenseSecretRef: {}
external:
  addresses:
  - localhost
  enabled: true
  service:
    enabled: true
  type: NodePort
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: docker.redpanda.com/redpandadata/redpanda
  tag: ""
imagePullSecrets: []
license_key: ""
license_secret_ref: {}
listeners:
  admin:
    external:
      default:
        advertisedPorts:
        - 31644
        enabled: false
        port: 9645
        tls:
          cert: external
    port: 9644
    tls:
      cert: default
      requireClientAuth: false
  http:
    authenticationMethod: null
    enabled: true
    external:
      default:
        advertisedPorts:
        - 30912
        authenticationMethod: null
        port: 8083
        tls:
          cert: external
          requireClientAuth: false
    kafkaEndpoint: default
    port: 8082
    tls:
      cert: default
      requireClientAuth: false
  kafka:
    authenticationMethod: null
    external:
      default:
        advertisedPorts:
        - 30910
        authenticationMethod: null
        port: 9094
        tls:
          cert: external
    port: 9093
    tls:
      cert: default
      requireClientAuth: false
  rpc:
    port: 33145
    tls:
      cert: default
      requireClientAuth: false
  schemaRegistry:
    authenticationMethod: null
    enabled: false
    external:
      default:
        advertisedPorts:
        - 30081
        authenticationMethod: null
        port: 8084
        tls:
          cert: external
          requireClientAuth: false
    kafkaEndpoint: default
    port: 8081
    tls:
      cert: default
      requireClientAuth: false
logging:
  logLevel: info
  usageStats:
    enabled: true
monitoring:
  enabled: false
  labels: {}
  scrapeInterval: 30s
nameOverride: ""
nodeSelector: {}
post_install_job:
  affinity: {}
  enabled: true
  podTemplate:
    annotations: {}
    labels: {}
    spec:
      containers:
      - env: []
        name: post-install
        securityContext: {}
      securityContext: {}
rackAwareness:
  enabled: false
  nodeAnnotation: topology.kubernetes.io/zone
rbac:
  annotations: {}
  enabled: false
resources:
  cpu:
    cores: 1
  memory:
    container:
      max: 2.5Gi
serviceAccount:
  annotations: {}
  create: false
  name: ""
statefulset:
  additionalRedpandaCmdFlags: []
  additionalSelectorLabels: {}
  annotations: {}
  budget:
    maxUnavailable: 1
  extraVolumeMounts: ""
  extraVolumes: ""
  initContainerImage:
    repository: busybox
    tag: latest
  initContainers:
    configurator:
      extraVolumeMounts: ""
      resources: {}
    extraInitContainers: ""
    fsValidator:
      enabled: false
      expectedFS: xfs
      extraVolumeMounts: ""
      resources: {}
    setDataDirOwnership:
      enabled: false
      extraVolumeMounts: ""
      resources: {}
    setTieredStorageCacheDirOwnership:
      extraVolumeMounts: ""
      resources: {}
    tuning:
      extraVolumeMounts: ""
      resources: {}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 10
    periodSeconds: 10
  nodeSelector: {}
  podAffinity: {}
  podAntiAffinity:
    custom: {}
    topologyKey: kubernetes.io/hostname
    type: hard
    weight: 100
  podTemplate:
    annotations: {}
    labels: {}
    spec:
      containers:
      - env: []
        name: redpanda
        securityContext: {}
      securityContext: {}
  priorityClassName: ""
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
  replicas: 1
  securityContext:
    fsGroup: 101
    fsGroupChangePolicy: OnRootMismatch
    runAsUser: 101
  sideCars:
    configWatcher:
      enabled: true
      extraVolumeMounts: ""
      resources: {}
      securityContext: {}
    controllers:
      createRBAC: true
      enabled: false
      healthProbeAddress: :8085
      image:
        repository: docker.redpanda.com/redpandadata/redpanda-operator
        tag: v2.2.4-24.2.5
      metricsAddress: :9082
      resources: {}
      run:
      - all
      securityContext: {}
  startupProbe:
    failureThreshold: 120
    initialDelaySeconds: 1
    periodSeconds: 10
  terminationGracePeriodSeconds: 90
  tolerations: []
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
  updateStrategy:
    type: RollingUpdate
storage:
  hostPath: /dev/kafka
  persistentVolume:
    annotations: {}
    enabled: false
    labels: {}
    nameOverwrite: ""
    size: 20Gi
    storageClass: ""
  tiered:
    config:
      cloud_storage_cache_size: 5368709120
      cloud_storage_enable_remote_read: true
      cloud_storage_enable_remote_write: true
      cloud_storage_enabled: false
    credentialsSecretRef:
      accessKey:
        configurationKey: cloud_storage_access_key
      secretKey:
        configurationKey: cloud_storage_secret_key
    hostPath: ""
    mountType: emptyDir
    persistentVolume:
      annotations: {}
      labels: {}
      storageClass: ""
tests:
  enabled: true
tls:
  certs:
    default:
      caEnabled: true
    external:
      caEnabled: true
  enabled: false
tolerations: []
tuning:
  tune_aio_events: true

Anything else we need to know?

No response

Which are the affected charts?

Redpanda

Chart Version(s)

$ helm -n <redpanda-release-namespace> list 
redpanda-ba4e1924  	test	1       	2024-10-15 13:58:45.893116 +1000 AEST	failed  	redpanda-5.9.7   	v24.2.7 

Cloud provider

orbstack mac pro m2

JIRA Link: K8S-388

@mattfysh
Copy link
Author

currently working around this by using a pulumi chart transform to remove the bootstrap-yaml-envsubstr init container before deployment

strikes me as rather odd that this container which was recently introduced produces no logs, and you cannot exec into the container to try and debug otherwise you get an OCI error because bash/zsh/dash/sh are all not found in PATH. this container probably needs to be battle-tested a little more if the plan is to keep it in the helm chart

@chrisseto
Copy link
Contributor

The container in question uses distroless as a base so the inability to exec is expected. The lack of logs is expected as well actually because the container is only reading and then writing a file (it's effectively just envsubst).

We've not seen any issues with hanging thus far. Could you share any relevant output from kubectl get events and the output of kubectl get pod <any-redpanda-pod> -o yaml? Is your orbstack configured to emulate x86 or is it running arm containers?

@chrisseto
Copy link
Contributor

This should only be affecting >= [redpanda-5.9.6](https://github.com/redpanda-data/helm-charts/releases/tag/redpanda-5.9.6).

If possible, would you mind trying out 5.9.5 to see if anything similarly strange happens?

@chrisseto
Copy link
Contributor

Ah, kubectl describe pod <release-name>-0 would actually be the most helpful. Deploying your values with a few modifications to the secret names seems to startup just fine for me though it certainly looks like the Pod is stuck in init while kubelet complains about a missing secret or configmap.

@mattfysh
Copy link
Author

mattfysh commented Oct 16, 2024

Hi Chris, thanks for the reply. I've been trying all morning to find as much info as I can. Here's what I have so far

  • im using orbstack with arm containers
  • im also deploying via pulumi-kubernetes
  • im able to reproduce almost always when using pulumi
  • i was only able to reproduce once when using helm directly
  • the bug seems to occur when the helm chart does not produce the configuration job

Here is a log of pod statuses when deploying successfully via helm CLI. There was only one time when the helm CLI did not produce the configuration job/container, which caused the main redpanda pod to get stuck at "Init:2/3"... however, even in this log below of a successful rollout, isn't it strange that the 3rd init container is taking more than 10 seconds? Does this bootstrap-yaml-envsubst container have any external network dependencies, or similar?

I've only been able to successfully rollout twice with pulumi-kubernetes. I'm not sure why the success rate for pulumi is lower, I suspect they may be using an older version of helm internally?

redpanda-console-866dbb889c-x9fcf   0/1     Running    0             1s
redpanda-configuration-4lx44        0/1     Init:0/1   0             1s
redpanda-0                          0/2     Init:1/3   0             1s


redpanda-console-866dbb889c-x9fcf   0/1     Running    0             1s
redpanda-configuration-4lx44        0/1     Init:0/1   0             1s
redpanda-0                          0/2     Init:2/3   0             1s


redpanda-console-866dbb889c-x9fcf   0/1     Running           0             13s
redpanda-configuration-4lx44        0/1     Init:0/1          0             13s
redpanda-0                          0/2     PodInitializing   0             13s


redpanda-console-866dbb889c-x9fcf   0/1     Running           0             14s
redpanda-configuration-4lx44        0/1     PodInitializing   0             14s
redpanda-0                          0/2     PodInitializing   0             14s


redpanda-console-866dbb889c-x9fcf   0/1     Running           0             15s
redpanda-configuration-4lx44        1/1     Running           0             15s
redpanda-0                          0/2     PodInitializing   0             15s


redpanda-console-866dbb889c-x9fcf   0/1     Running     0             16s
redpanda-configuration-4lx44        0/1     Completed   0             16s
redpanda-0                          1/2     Running     0             16s

Here is the values.yaml I'm using:

statefulset:
  replicas: 1
tls:
  enabled: false
listeners:
  kafka:
    external:
      default:
        advertisedPorts:
        - 31093
  admin:
    enabled: false
    external:
      default:
        enabled: false
  schemaRegistry:
    enabled: false
    external:
      default:
        enabled: false
  http:
    enabled: false
    external:
      default:
        enabled: false

I ran the commands you've provided above and did a diff between a successful rollout and a failed one, but there were no significant differences

Not sure where to go from here, I'm a little confused as to which tool is causing the error as it seems strange to me that I've seen both successful and failed rollouts from both the helm CLI and the pulumi CLI.

@mattfysh
Copy link
Author

code available here:

https://github.com/mattfysh/redpanda-deploy-bug

@mattfysh
Copy link
Author

I've opened a bug with pulumi-kubernetes: pulumi/pulumi-kubernetes#3265
but I'm still not convinced that's where there root cause lies

is it possible to get some logging from the bootstrap-yaml-envsubst container? orbstack claims the container is running... so it must be doing -- something? can it not be logging every step along the way so we can see where its init process is hanging?

@mattfysh
Copy link
Author

The deployment failure is not occurring 100% of the time, suggesting that this is a toolchain issue - not an issue with the deployment values

@chrisseto
Copy link
Contributor

Thanks continuing to dig into this!

Here's the source of the envsubst container: https://github.com/redpanda-data/redpanda-operator/blob/main/operator/cmd/envsubst/envsubst.go. Disk IO is really the only thing its doing that could be hanging.

Could you run kubectl describe pod redpanda-0 on one of the failing deployments and share the output? I'd like to see the full container status and more importantly any events associated with the Pod.

@chrisseto
Copy link
Contributor

the bug seems to occur when the helm chart does not produce the configuration job

This is likely a red herring. The configuration job will only run after the StatefulSet becomes Ready which is prevented if the init container hangs.

@mattfysh
Copy link
Author

Name:             redpanda-ceda518c-0
Namespace:        default
Priority:         0
Service Account:  default
Node:             orbstack/198.19.249.2
Start Time:       Thu, 17 Oct 2024 03:02:11 +1000
Labels:           app.kubernetes.io/component=redpanda-statefulset
                  app.kubernetes.io/instance=redpanda-ceda518c
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=redpanda
                  apps.kubernetes.io/pod-index=0
                  controller-revision-hash=redpanda-ceda518c-74f6484f4b
                  helm.sh/chart=redpanda-5.9.7
                  redpanda.com/poddisruptionbudget=redpanda-ceda518c
                  statefulset.kubernetes.io/pod-name=redpanda-ceda518c-0
Annotations:      config.redpanda.com/checksum: f05f9a3c004ec98ce95cc606dac824014cc5bd64cb1c44fe8f6968b659c6d979
Status:           Pending
IP:               192.168.194.42
IPs:
  IP:           192.168.194.42
  IP:           fd07:b51a:cc66:a::2a
Controlled By:  StatefulSet/redpanda-ceda518c
Init Containers:
  tuning:
    Container ID:  docker://e1c9a6bbcfcc27f6f363817b5f8fecaa407469578ea8f45f90d8aad6e277d77a
    Image:         docker.redpanda.com/redpandadata/redpanda:v24.2.7
    Image ID:      docker-pullable://docker.redpanda.com/redpandadata/redpanda@sha256:82a69763bef8d8b55ea5a520fa1b38f993908ef68946819ca1aed43541824c48
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      rpk redpanda tune all
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 17 Oct 2024 03:02:12 +1000
      Finished:     Thu, 17 Oct 2024 03:02:12 +1000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/redpanda from base-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
  redpanda-configurator:
    Container ID:  docker://1de50536c1b7eac009558a7bc1fd0c0b61b2bbf0ce6eecf44e01c34e4f0ee612
    Image:         docker.redpanda.com/redpandadata/redpanda:v24.2.7
    Image ID:      docker-pullable://docker.redpanda.com/redpandadata/redpanda@sha256:82a69763bef8d8b55ea5a520fa1b38f993908ef68946819ca1aed43541824c48
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/bash
      -c
      trap "exit 0" TERM; exec $CONFIGURATOR_SCRIPT "${SERVICE_NAME}" "${KUBERNETES_NODE_NAME}" & wait $!
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 17 Oct 2024 03:02:13 +1000
      Finished:     Thu, 17 Oct 2024 03:02:13 +1000
    Ready:          True
    Restart Count:  0
    Environment:
      CONFIGURATOR_SCRIPT:   /etc/secrets/configurator/scripts/configurator.sh
      SERVICE_NAME:          redpanda-ceda518c-0 (v1:metadata.name)
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      HOST_IP_ADDRESS:        (v1:status.hostIP)
    Mounts:
      /etc/redpanda from config (rw)
      /etc/secrets/configurator/scripts/ from redpanda-ceda518c-configurator (rw)
      /tmp/base-config from base-config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
  bootstrap-yaml-envsubst:
    Container ID:  docker://37eb987a144d998148954cb49183953b5c5f1d70197736abeb19f3dd71eba8e4
    Image:         docker.redpanda.com/redpandadata/redpanda-operator:v2.2.4-24.2.5
    Image ID:      docker-pullable://docker.redpanda.com/redpandadata/redpanda-operator@sha256:17979d5443f420a1791edb067149d841bb8251c534e1c289a8fbc11392a7aca2
    Port:          <none>
    Host Port:     <none>
    Command:
      /redpanda-operator
      envsubst
      /tmp/base-config/bootstrap.yaml
      --output
      /tmp/config/.bootstrap.yaml
    State:          Running
      Started:      Thu, 17 Oct 2024 03:02:14 +1000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  25Mi
    Requests:
      cpu:        100m
      memory:     25Mi
    Environment:  <none>
    Mounts:
      /tmp/base-config/ from base-config (rw)
      /tmp/config/ from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
Containers:
  redpanda:
    Container ID:  
    Image:         docker.redpanda.com/redpandadata/redpanda:v24.2.7
    Image ID:      
    Ports:         9644/TCP, 8082/TCP, 9093/TCP, 9094/TCP, 33145/TCP, 8081/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      rpk
      redpanda
      start
      --advertise-rpc-addr=$(SERVICE_NAME).redpanda-ceda518c.default.svc.cluster.local.:33145
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2560Mi
    Requests:
      cpu:      1
      memory:   2560Mi
    Liveness:   exec [/bin/sh -c curl --silent --fail -k -m 5  "http://${SERVICE_NAME}.redpanda-ceda518c.default.svc.cluster.local.:9644/v1/status/ready"] delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/bin/sh -c set -x
RESULT=$(rpk cluster health)
echo $RESULT
echo $RESULT | grep 'Healthy:.*true'
] delay=1s timeout=1s period=10s #success=1 #failure=3
    Startup:  exec [/bin/sh -c set -e
RESULT=$(curl --silent --fail -k -m 5  "http://${SERVICE_NAME}.redpanda-ceda518c.default.svc.cluster.local.:9644/v1/status/ready")
echo $RESULT
echo $RESULT | grep ready
] delay=1s timeout=1s period=10s #success=1 #failure=120
    Environment:
      SERVICE_NAME:  redpanda-ceda518c-0 (v1:metadata.name)
      POD_IP:         (v1:status.podIP)
      HOST_IP:        (v1:status.hostIP)
    Mounts:
      /etc/redpanda from config (rw)
      /tmp/base-config from base-config (rw)
      /var/lib/redpanda/data from datadir (rw)
      /var/lifecycle from lifecycle-scripts (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
  config-watcher:
    Container ID:  
    Image:         docker.redpanda.com/redpandadata/redpanda:v24.2.7
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
    Args:
      -c
      trap "exit 0" TERM; exec /etc/secrets/config-watcher/scripts/sasl-user.sh & wait $!
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/redpanda from config (rw)
      /etc/secrets/config-watcher/scripts from redpanda-ceda518c-config-watcher (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtgbc (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir-redpanda-ceda518c-0
    ReadOnly:   false
  lifecycle-scripts:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redpanda-ceda518c-sts-lifecycle
    Optional:    false
  base-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      redpanda-ceda518c
    Optional:  false
  config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  redpanda-ceda518c-configurator:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redpanda-ceda518c-configurator
    Optional:    false
  redpanda-ceda518c-config-watcher:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  redpanda-ceda518c-config-watcher
    Optional:    false
  kube-api-access-dtgbc:
    Type:                     Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:   3607
    ConfigMapName:            kube-root-ca.crt
    ConfigMapOptional:        <nil>
    DownwardAPI:              true
QoS Class:                    Burstable
Node-Selectors:               <none>
Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Topology Spread Constraints:  topology.kubernetes.io/zone:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/component=redpanda-statefulset,app.kubernetes.io/instance=redpanda-ceda518c,app.kubernetes.io/name=redpanda
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  71s   default-scheduler  Successfully assigned default/redpanda-ceda518c-0 to orbstack
  Normal  Pulled     71s   kubelet            Container image "docker.redpanda.com/redpandadata/redpanda:v24.2.7" already present on machine
  Normal  Created    71s   kubelet            Created container tuning
  Normal  Started    71s   kubelet            Started container tuning
  Normal  Pulled     70s   kubelet            Container image "docker.redpanda.com/redpandadata/redpanda:v24.2.7" already present on machine
  Normal  Created    70s   kubelet            Created container redpanda-configurator
  Normal  Started    70s   kubelet            Started container redpanda-configurator
  Normal  Pulled     69s   kubelet            Container image "docker.redpanda.com/redpandadata/redpanda-operator:v2.2.4-24.2.5" already present on machine
  Normal  Created    69s   kubelet            Created container bootstrap-yaml-envsubst
  Normal  Started    69s   kubelet            Started container bootstrap-yaml-envsubst

@mattfysh
Copy link
Author

I just noticed these very small limits on the init container that is stuck

    Limits:
      cpu:     100m
      memory:  25Mi
    Requests:
      cpu:        100m
      memory:     25Mi

coming from

Resources: corev1.ResourceRequirements{
Limits: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("100m"),
corev1.ResourceMemory: resource.MustParse("25Mi"),
},
Requests: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("100m"),
corev1.ResourceMemory: resource.MustParse("25Mi"),
},
},

given the flaky behaviour, im leaning towards this being an issue where the limit is reached - and perhaps this is an Orbstack issue where it is not providing any event or failure handling?

@chrisseto
Copy link
Contributor

I wouldn't expect that container to use up anything more than those resources. It's intentionally very lightweight (aside from the hang your seeing that is).

Finally tracked down the option to run VZ + rosetta VMs in the docker distro I use (We develop on macOS) and haven't run into any issues there.

I'll try to carve out some time to spin up a cluster on orbstack before the end of the week.

Oh, are you using orbstack's builtin Kubernetes distro? I generally run K8s via kind or k3d, that might have an effect as well.

@chrisseto
Copy link
Contributor

So I can reliably reproduce this once in orbstack. There's an initial hang on redpanda-0 and the configuration job (which also runs this container). After the initial hang, everything starts to work just fine. I can restart the pod without any additional delay and oddly, after the delay the container runs fine as well.

Stranger still we actually see logs from the configuration pod (which is a bug actually):

bootstrap-yaml-envsubst [controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
bootstrap-yaml-envsubst Detected at:
bootstrap-yaml-envsubst     >  goroutine 1 [running, locked to thread]:
bootstrap-yaml-envsubst     >  runtime/debug.Stack()
bootstrap-yaml-envsubst     >      /nix/store/wblrs8j0qvx046nqmlqqpca9lvdb9pr4-go-1.22.7/share/go/src/runtime/debug/stack.go:24 +0x64
bootstrap-yaml-envsubst     >  sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
bootstrap-yaml-envsubst     >      /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.5/pkg/log/log.go:60 +0xf4
bootstrap-yaml-envsubst     >  sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithName(0x400016ad80, {0x2e66107, 0x6})
bootstrap-yaml-envsubst     >      /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.5/pkg/log/deleg.go:147 +0x34
bootstrap-yaml-envsubst     >  github.com/go-logr/logr.Logger.WithName(...)
bootstrap-yaml-envsubst     >      /root/go/pkg/mod/github.com/go-logr/logr@v1.4.2/logr.go:345
bootstrap-yaml-envsubst     >  sigs.k8s.io/controller-runtime/pkg/client/config.init()
bootstrap-yaml-envsubst     >      /root/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.18.5/pkg/client/config/config.go:37 +0x54

I can't reproduce this pause via docker run either but you can observe the same stall if you docker exec into a stalled container and run /redpanda-operator:

❯ docker ps
CONTAINER ID   IMAGE                        COMMAND                  CREATED          STATUS          PORTS     NAMES
02d7d42fbfe1   ab6ea2bd32de                 "/redpanda-operator …"   57 seconds ago   Up 56 seconds             k8s_bootstrap-yaml-envsubst_redpanda-1729202509-0_redpanda_8dba84a9-b570-4e5c-ac65-6fa379f70c4a_0
9e7e3ee7595d   rancher/mirrored-pause:3.6   "/pause"                 58 seconds ago   Up 58 seconds             k8s_POD_redpanda-1729202509-0_redpanda_8dba84a9-b570-4e5c-ac65-6fa379f70c4a_0
d53e7c3aaeb9   5b2b6304b8e5                 "./console --config.…"   4 minutes ago    Up 4 minutes              k8s_console_redpanda-1729202509-console-5f54f45844-66n4f_redpanda_e6fea4d1-286e-4f26-9058-124806eda15a_6
c3c009eb42da   rancher/mirrored-pause:3.6   "/pause"                 9 minutes ago    Up 9 minutes              k8s_POD_redpanda-1729202509-console-5f54f45844-66n4f_redpanda_e6fea4d1-286e-4f26-9058-124806eda15a_0
a217263b6209   9c783c3f8f50                 "/app/cmd/webhook/we…"   10 minutes ago   Up 10 minutes             k8s_cert-manager-webhook_cert-manager-webhook-bd76f6cf9-dgl7x_cert-manager_d350e298-e445-4bb2-8202-26df705939a7_0
31ce1d21d29d   fbc85505a489                 "/app/cmd/controller…"   10 minutes ago   Up 10 minutes             k8s_cert-manager-controller_cert-manager-86844d9d7-wxsmk_cert-manager_f49c9377-3451-4786-9d36-5e19a8829c95_0
bf602e4abb9d   f23e287a379b                 "/app/cmd/cainjector…"   10 minutes ago   Up 10 minutes             k8s_cert-manager-cainjector_cert-manager-cainjector-6d5f558c69-vcgdg_cert-manager_99dc27da-5b8b-4a2f-9fcd-f7af783da76e_0
bda89308d5d1   rancher/mirrored-pause:3.6   "/pause"                 10 minutes ago   Up 10 minutes             k8s_POD_cert-manager-webhook-bd76f6cf9-dgl7x_cert-manager_d350e298-e445-4bb2-8202-26df705939a7_0
fd72617ae1e3   rancher/mirrored-pause:3.6   "/pause"                 10 minutes ago   Up 10 minutes             k8s_POD_cert-manager-86844d9d7-wxsmk_cert-manager_f49c9377-3451-4786-9d36-5e19a8829c95_0
7dc44a7b12b5   rancher/mirrored-pause:3.6   "/pause"                 10 minutes ago   Up 10 minutes             k8s_POD_cert-manager-cainjector-6d5f558c69-vcgdg_cert-manager_99dc27da-5b8b-4a2f-9fcd-f7af783da76e_0
548d69da55fe   a7f913520a4a                 "local-path-provisio…"   10 minutes ago   Up 10 minutes             k8s_local-path-provisioner_local-path-provisioner-6c86858495-hgz4z_kube-system_5ab8c498-5ab6-42c0-b60c-0459cf1bb9d6_0
6c2d235fe64c   97e04611ad43                 "/coredns -conf /etc…"   10 minutes ago   Up 10 minutes             k8s_coredns_coredns-66cc6945cb-4m7nh_kube-system_87e789d8-4bfb-48f7-8e87-48de4a34cde4_0
4a6a77d2c33a   rancher/mirrored-pause:3.6   "/pause"                 10 minutes ago   Up 10 minutes             k8s_POD_local-path-provisioner-6c86858495-hgz4z_kube-system_5ab8c498-5ab6-42c0-b60c-0459cf1bb9d6_0
3f37ab5c392d   rancher/mirrored-pause:3.6   "/pause"                 10 minutes ago   Up 10 minutes             k8s_POD_coredns-66cc6945cb-4m7nh_kube-system_87e789d8-4bfb-48f7-8e87-48de4a34cde4_0
❯ docker exec -it 02d7d42fbfe1 /redpanda-operator

Figured it out 🤦 though I have many many questions about how and why we're seeing this exact behavior.

I misremembered the binary size and was taken aback when double checking it was indeed the aarch64 binary causing issues. The binary is ~80MB, much larger than the memory limit.

You can reproduce this on orbstack's docker by running:

# Long stall
docker run -m 25MB --entrypoint=/redpanda-operator localhost/redpanda-operator:dev envsubst --help
# Immediately runs
docker run -m 100MB --entrypoint=/redpanda-operator localhost/redpanda-operator:dev envsubst --help

I'll bump the limit to 100MB in the next release and drop something into the backlog to spit the binaries apart so the overall footprint is lower but still within a single container. Thank you again for your patience and my apologies for initially dismissing the constrained resources.

@mattfysh
Copy link
Author

that's interesting how the binary is 80Mb, the limit is only 25Mb but it worked some of the time via pulumi and almost always via helm. I wonder if there are differences in how both those CLIs deploy a helm chart, and whether the limit check is done via polling metrics. If polling, this would explain why sometimes the deployment would work. I assume the container completed between polls?

chrisseto added a commit that referenced this issue Oct 18, 2024
Prior to this commit the memory limits of the `bootstrap-yaml-envsubst`
container were set to 25Mi. This value was accidentally lower than the total
size of the binary itself, 80Mi, which seemingly surfaced as unexplained hangs
when _initially_ run on aarch64 (specifically using obstack).

This commit bumps the limit to 125Mi to ensure adequate headroom which seems to
mitigate such hangs. The exact mechanisms at play here at not well known.

Fixes #1564
@mattfysh
Copy link
Author

hi @chrisseto - thanks for the fix! 🙌 any thoughts on why this only occurs in orbstack? i don't seem to be running into this problem in other k8s runtimes. i'm also still unsure why its so easily reproducible with pulumi, and less so when using the helm CLI directly

@chrisseto
Copy link
Contributor

I honestly have no idea. We're all pretty stumped on this one. It could very well be a quick of arm64 vs amd64 or obstack itself. Unfortunately, I can't really justify the time to dig deeper as we have a workable solution and the check to make sure this doesn't happen again seems to be trivial (bin size <= mem limit).

Keep us in the loop if you make any discoveries!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
@mattfysh @chrisseto and others