Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress-nginx Pod Prematurely Marked Ready, Causing HTTP 404 Errors from default backend routing #12206

Open
Izzette opened this issue Oct 16, 2024 · 11 comments
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@Izzette
Copy link

Izzette commented Oct 16, 2024

What happened:

Upon ingress-nginx pod boot-up sequence on Kubernetes, our clients are receiving back HTTP 404 responses from nginx itself for HTTP path that are declared in some of our Ingresses. This situation only happens when the pod is booting up and not when a hot reload sequence is initiated.

While the pod is marked as Ready in Kubernetes, we suspect that the nginx configuration is not fully loaded and some of the requests are forwarded to the upstream-default-backend upstream (see screenshot below and the pod logs in CSV).

For reference we defined quite a lot of Ingresses in our cluster with a lot of different paths. The resulting nginx configuration is quite heavy to load as its approximately 67MB.

image
Requests served by the default backend by each pod just after it starts up

image
Count of pods in “ready” state

You can see in the above two graphs that after 3 out of the 4 pods in the ingress-nginx-external-controller-7c8576cd ReplicaSet become Ready (ingress-nginx-controller /healthz endpoint returns 200) several thousand requests are served by the default backend over the course of ~30s. This occurs even after the 10s initial delay for the readiness and liveness probes has been surpassed.

ingress-nginx-controller pod logs after startup. Notice the multiple reloads and the change of backend after the last reload.
Date,Pod Name,Message
2024-10-15T13:37:44.477Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:44.477208 8 main.go:205] ""Creating API client"" host=""https://100.76.0.1:443"""
2024-10-15T13:37:44.483Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:44.483986 8 main.go:248] ""Running in Kubernetes cluster"" major=""1"" minor=""30"" git=""v1.30.5-gke.1014001"" state=""clean"" commit=""c9d757f7eeb6b159f3a64f6cb3bf7007d65c1f19"" platform=""linux/amd64"""
2024-10-15T13:37:44.570Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:44.570002 8 main.go:101] ""SSL fake certificate created"" file=""/etc/ingress-controller/ssl/default-fake-certificate.pem"""
2024-10-15T13:37:44.627Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:44.627215 8 nginx.go:271] ""Starting NGINX Ingress controller"""
2024-10-15T13:37:44.630Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:44.630735 8 store.go:535] ""ignoring ingressclass as the spec.controller is not the same of this ingress"" ingressclass=""nginx-internal"""
2024-10-15T13:37:44.845Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:44.845077 8 event.go:377] Event(v1.ObjectReference{Kind:""ConfigMap"", Namespace:""network"", Name:""ingress-nginx-external-controller"", UID:""8953e6b2-9833-4fb7-8339-a362a015f525"", APIVersion:""v1"", ResourceVersion:""1073189370"", FieldPath:""""}): type: 'Normal' reason: 'CREATE' ConfigMap network/ingress-nginx-external-controller"
2024-10-15T13:37:46.028Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:46.028658 8 nginx.go:317] ""Starting NGINX process"""
2024-10-15T13:37:46.029Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""",I1015 13:37:46.029013 8 leaderelection.go:254] attempting to acquire leader lease network/ingress-nginx-external-leader...
2024-10-15T13:37:46.038Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:46.038088 8 status.go:85] ""New leader elected"" identity=""ingress-nginx-external-controller-865c5d89b5-ksxl5"""
2024-10-15T13:37:46.835Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:46.835434 8 controller.go:213] ""Backend successfully reloaded"""
2024-10-15T13:37:46.835Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:46.835571 8 controller.go:224] ""Initial sync, sleeping for 1 second"""
2024-10-15T13:37:46.835Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:46.835760 8 event.go:377] Event(v1.ObjectReference{Kind:""Pod"", Namespace:""network"", Name:""ingress-nginx-external-controller-56bbcdd967-qg7p7"", UID:""fbe312af-999f-48ab-952a-8dc437a3d4bc"", APIVersion:""v1"", ResourceVersion:""1079569510"", FieldPath:""""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration"
2024-10-15T13:37:49.728Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:49.728678 8 controller.go:193] ""Configuration changes detected, backend reload required"""
2024-10-15T13:37:51Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:37:58.178Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:58.178206 8 controller.go:213] ""Backend successfully reloaded"""
2024-10-15T13:37:58.178Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:37:58.178586 8 event.go:377] Event(v1.ObjectReference{Kind:""Pod"", Namespace:""network"", Name:""ingress-nginx-external-controller-56bbcdd967-qg7p7"", UID:""fbe312af-999f-48ab-952a-8dc437a3d4bc"", APIVersion:""v1"", ResourceVersion:""1079569510"", FieldPath:""""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration"
2024-10-15T13:37:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:37:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:37:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:37:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:37:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:37:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:37:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:37:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:37:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:38:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:38:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:38:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:38:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:38:59Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":404,""method"":""GET"",""proxyUpstreamName"":""upstream-default-backend""}"
2024-10-15T13:38:25.571Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:38:25.571800 8 status.go:85] ""New leader elected"" identity=""ingress-nginx-external-controller-865c5d89b5-cfhfc"""
2024-10-15T13:39:20.376Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""",-------------------------------------------------------------------------------
2024-10-15T13:39:20.376Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""",NGINX Ingress controller
2024-10-15T13:39:20.376Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""",Release: v1.11.3
2024-10-15T13:39:20.376Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""",Build: 0106de65cfccb74405a6dfa7d9daffc6f0a6ef1a
2024-10-15T13:39:20.376Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""",Repository: https://github.com/kubernetes/ingress-nginx
2024-10-15T13:39:20.376Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""",nginx version: nginx/1.25.5
2024-10-15T13:39:20.376Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""",-------------------------------------------------------------------------------
2024-10-15T13:39:44.157Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""",I1015 13:39:44.157221 8 leaderelection.go:268] successfully acquired lease network/ingress-nginx-external-leader
2024-10-15T13:39:44.157Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:39:44.157521 8 status.go:85] ""New leader elected"" identity=""ingress-nginx-external-controller-56bbcdd967-qg7p7"""
2024-10-15T13:59:07.025Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:59:07.025134 8 controller.go:193] ""Configuration changes detected, backend reload required"""
2024-10-15T13:59:15.602Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:59:15.602900 8 controller.go:213] ""Backend successfully reloaded"""
2024-10-15T13:59:15.603Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","I1015 13:59:15.603247 8 event.go:377] Event(v1.ObjectReference{Kind:""Pod"", Namespace:""network"", Name:""ingress-nginx-external-controller-56bbcdd967-qg7p7"", UID:""fbe312af-999f-48ab-952a-8dc437a3d4bc"", APIVersion:""v1"", ResourceVersion:""1079569510"", FieldPath:""""}): type: 'Normal' reason: 'RELOAD' NGINX reload triggered due to a change in configuration"
2024-10-15T13:59:19Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":200,""method"":""GET"",""proxyUpstreamName"":""api-gateway""}"
2024-10-15T13:59:19Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":200,""method"":""GET"",""proxyUpstreamName"":""api-gateway""}"
2024-10-15T13:59:19Z,"""ingress-nginx-external-controller-56bbcdd967-qg7p7""","{""Attributes"":{""service"":{""name"":""nginx-ingress-controller""},""http"":{""status_code"":200,""method"":""GET"",""proxyUpstreamName"":""api-gateway""}"

What you expected to happen:

The Ingress-nginx pod should not be marked as ready while still loading its configuration and we should not get HTTP 404 from nginx itself.

NGINX Ingress controller version

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.11.3
  Build:         0106de65cfccb74405a6dfa7d9daffc6f0a6ef1a
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.25.5

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version): Server Version: v1.30.5-gke.1014001

Environment:

  • Cloud provider or hardware configuration: Google Cloud Platform / GKE / GCE

  • OS (e.g. from /etc/os-release): https://cloud.google.com/container-optimized-os/docs/release-notes/m113#cos-113-18244-151-14_

  • Kernel (e.g. uname -a): https://cos.googlesource.com/third_party/kernel/+/f2b7676b27982b8ce21e62319fceb9a0fd4131c5

  • Install tools: GKE

  • Basic cluster related info:

    • kubectl version : Server Version: v1.30.5-gke.1014001
  • How was the ingress-nginx-controller installed:

    • ingress-nginx-controller is installed with ArgoCD using helm templating.
      controller:
        admissionWebhooks:
          enabled: false
          timeoutSeconds: 15
        # required because Ingresses leverage custom annotation
        allowSnippetAnnotations: true
        autoscaling:
          enabled: true
          maxReplicas: 12
          minReplicas: 3
          targetCPUUtilizationPercentage: 80
          targetMemoryUtilizationPercentage: null
        config:
          brotli-level: "6"
          brotli-min-length: "50"
          brotli-types: '*'
          client-body-buffer-size: 10m # REDACTED
          enable-brotli: "true"
          enable-opentelemetry: "true"
          gzip-level: "6"
          gzip-types: '*'
          # nginx status page is required for the datadog nginx-ingress integration
          http-snippet: |
            server {
              listen 18080;
      
              location /nginx_status {
                allow 10.0.0.0/8;
                stub_status on;
                access_log off;
              }
      
              location / {
                return 404;
              }
            }
          limit-req-status-code: "429"
          # opentelemetry configuration
          location-snippet: |
            opentelemetry_attribute "resource.name" "$uri";
          log-format-escaping: json
          opentelemetry-operation-name: HTTP $request_method $service_name $uri
          otel-sampler: TraceIdRatioBased
          otel-sampler-ratio: "0.05"
          otel-service-name: nginx-ingress-controller
          otlp-collector-host: datadog-agent.monitoring.svc.cluster.local
          proxy-buffer-size: 128k
          # wait 50s before closing the connection. GCP LBs are configured with a 60s timeout
          proxy-read-timeout: "50"
          service-upstream: "true"
          use-forwarded-headers: "true"
          use-gzip: "true"
          worker-processes: "4"
          worker-shutdown-timeout: 65s
        # Configures the controller container name
        # note: this is required for DataDog agent annotation parsing because the default container name is `controller`
        #       which is not allowed by Datadog
        containerName: ingress-nginx-controller
        extraEnvs:
          - name: HOST_IP # Used by datadog-tracer to send trace to datadog-agent
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: DD_SERVICE # Used by the tracer to add context to traces
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['tags.datadoghq.com/service']
          - name: DD_VERSION # Used by the tracer to add context to traces
            valueFrom:
              fieldRef:
                fieldPath: metadata.labels['tags.datadoghq.com/version']
        labels:
          REDACTED
        metrics:
          enabled: true
        nodeSelector:
          cloud.google.com/compute-class: core
          # can't have compute-class + selector on os
          kubernetes.io/os: null
        podAnnotations:
          ad.datadoghq.com/ingress-nginx-controller.check_names: '["nginx","nginx_ingress_controller"]'
          ad.datadoghq.com/ingress-nginx-controller.init_configs: '[{},{}]'
          ad.datadoghq.com/ingress-nginx-controller.instances: '[{"nginx_status_url": "http://%%host%%:18080/nginx_status"},{"prometheus_url": "http://%%host%%:10254/metrics", "extra_metrics": ["^nginx_ingress_controller_admission.*"]}]'
          ad.datadoghq.com/ingress-nginx-controller.logs: |-
            REDACTED
        priorityClassName: critical-priority
        replicaCount: 3
        resources:
          limits:
            memory: 5Gi
          requests:
            cpu: 2
            memory: 4.5Gi
        service:
          type: ClusterIP
        topologySpreadConstraints:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: ingress-nginx-external
            maxSkew: 1
            topologyKey: topology.kubernetes.io/zone
            whenUnsatisfiable: DoNotSchedule
          - labelSelector:
              matchLabels:
                app.kubernetes.io/instance: ingress-nginx-external
            maxSkew: 1
            topologyKey: kubernetes.io/hostname
            whenUnsatisfiable: ScheduleAnyway
        updateStrategy:
          rollingUpdate:
            maxUnavailable: 20%
          type: RollingUpdate
  • Current State of the controller:

    • `kubectl describe ingressclasses

      NAME             CONTROLLER                      PARAMETERS   AGE
      nginx            k8s.io/ingress-nginx            <none>       236d
      nginx-internal   k8s.io/ingress-nginx-internal   <none>       236d
      

      kubectl -n <ingresscontrollernamespace> get all -A -o wide

      NAME                                                     READY   STATUS    RESTARTS   AGE    IP            NODE                                                  NOMINATED NODE   READINESS GATES
      pod/ingress-nginx-external-controller-6b6c646789-6ttl7   1/1     Running   0          3h3m   10.88.91.14   gke-gke-prod-main-eu-nap-n2-standard--0363057d-4wmn   <none>           1/1
      pod/ingress-nginx-external-controller-6b6c646789-7t4nj   1/1     Running   0          3h2m   10.88.52.28   gke-gke-prod-main-eu-nap-n2-standard--e19b8177-f0cv   <none>           1/1
      pod/ingress-nginx-external-controller-6b6c646789-gt4dv   1/1     Running   0          3h5m   10.88.134.3   gke-gke-prod-main-eu-nap-n2-standard--700eb490-djgg   <none>           1/1
      pod/ingress-nginx-internal-controller-dbbc96cf8-68xd6    1/1     Running   0          3h8m   10.88.52.27   gke-gke-prod-main-eu-nap-n2-standard--e19b8177-f0cv   <none>           1/1
      pod/ingress-nginx-internal-controller-dbbc96cf8-9mlb9    1/1     Running   0          178m   10.88.134.8   gke-gke-prod-main-eu-nap-n2-standard--700eb490-djgg   <none>           1/1
      pod/ingress-nginx-internal-controller-dbbc96cf8-dkzq9    1/1     Running   0          3h7m   10.88.91.13   gke-gke-prod-main-eu-nap-n2-standard--0363057d-4wmn   <none>           1/1
      
      NAME                                                TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE    SELECTOR
      service/ingress-nginx-external-controller           ClusterIP   100.76.208.116   <none>        80/TCP,443/TCP   236d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx
      service/ingress-nginx-external-controller-metrics   ClusterIP   100.76.151.204   <none>        10254/TCP        5d2h   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx
      service/ingress-nginx-internal-controller           ClusterIP   100.76.147.44    <none>        80/TCP,443/TCP   236d   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx
      service/ingress-nginx-internal-controller-metrics   ClusterIP   100.76.36.65     <none>        10254/TCP        5d2h   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx
      
      NAME                                                READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS                 IMAGES                                                                                                                     SELECTOR
      deployment.apps/ingress-nginx-external-controller   3/3     3            3           236d   ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx
      deployment.apps/ingress-nginx-internal-controller   3/3     3            3           236d   ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx
      
      NAME                                                           DESIRED   CURRENT   READY   AGE     CONTAINERS                 IMAGES                                                                                                                     SELECTOR
      replicaset.apps/ingress-nginx-external-controller-5568bc46c    0         0         0       4d21h   ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=5568bc46c
      replicaset.apps/ingress-nginx-external-controller-5659859d99   0         0         0       2d4h    ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=5659859d99
      replicaset.apps/ingress-nginx-external-controller-56bbcdd967   0         0         0       22h     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=56bbcdd967
      replicaset.apps/ingress-nginx-external-controller-5d87b777bd   0         0         0       28h     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=5d87b777bd
      replicaset.apps/ingress-nginx-external-controller-5f4d667bc5   0         0         0       22h     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=5f4d667bc5
      replicaset.apps/ingress-nginx-external-controller-679bb99f57   0         0         0       45h     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=679bb99f57
      replicaset.apps/ingress-nginx-external-controller-6b6c646789   3         3         3       3h5m    ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=6b6c646789
      replicaset.apps/ingress-nginx-external-controller-7f97776494   0         0         0       44h     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=7f97776494
      replicaset.apps/ingress-nginx-external-controller-848fd9fcd5   0         0         0       28h     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=848fd9fcd5
      replicaset.apps/ingress-nginx-external-controller-865c5d89b5   0         0         0       23h     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=865c5d89b5
      replicaset.apps/ingress-nginx-external-controller-f955d6bd7    0         0         0       4h41m   ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx,pod-template-hash=f955d6bd7
      replicaset.apps/ingress-nginx-internal-controller-545c66cb99   0         0         0       9d      ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.9.5@sha256:b3aba22b1da80e7acfc52b115cae1d4c687172cbf2b742d5b502419c25ff340e    app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=545c66cb99
      replicaset.apps/ingress-nginx-internal-controller-567c444c68   0         0         0       4h41m   ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=567c444c68
      replicaset.apps/ingress-nginx-internal-controller-5d4765b75b   0         0         0       5d2h    ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.9.5@sha256:b3aba22b1da80e7acfc52b115cae1d4c687172cbf2b742d5b502419c25ff340e    app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=5d4765b75b
      replicaset.apps/ingress-nginx-internal-controller-5df7b548c9   0         0         0       68d     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.9.5@sha256:b3aba22b1da80e7acfc52b115cae1d4c687172cbf2b742d5b502419c25ff340e    app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=5df7b548c9
      replicaset.apps/ingress-nginx-internal-controller-6bf887fb8    0         0         0       2d4h    ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.9.5@sha256:b3aba22b1da80e7acfc52b115cae1d4c687172cbf2b742d5b502419c25ff340e    app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=6bf887fb8
      replicaset.apps/ingress-nginx-internal-controller-756849765d   0         0         0       68d     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.9.5@sha256:b3aba22b1da80e7acfc52b115cae1d4c687172cbf2b742d5b502419c25ff340e    app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=756849765d
      replicaset.apps/ingress-nginx-internal-controller-75859b97bd   0         0         0       65d     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.9.5@sha256:b3aba22b1da80e7acfc52b115cae1d4c687172cbf2b742d5b502419c25ff340e    app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=75859b97bd
      replicaset.apps/ingress-nginx-internal-controller-7b7cdc67b    0         0         0       23h     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=7b7cdc67b
      replicaset.apps/ingress-nginx-internal-controller-7bfdf65fcb   0         0         0       2d3h    ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=7bfdf65fcb
      replicaset.apps/ingress-nginx-internal-controller-9d5d8874b    0         0         0       22h     ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=9d5d8874b
      replicaset.apps/ingress-nginx-internal-controller-dbbc96cf8    3         3         3       3h8m    ingress-nginx-controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-internal,app.kubernetes.io/name=ingress-nginx,pod-template-hash=dbbc96cf8
      
      NAME                                                                    REFERENCE                                      TARGETS        MINPODS   MAXPODS   REPLICAS   AGE
      horizontalpodautoscaler.autoscaling/ingress-nginx-external-controller   Deployment/ingress-nginx-external-controller   cpu: 52%/80%   3         12        3          64d
      
    • kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>

      Name:                 ingress-nginx-external-controller-6b6c646789-6ttl7
      Namespace:            network
      Priority:             100000000
      Priority Class Name:  critical-priority
      Service Account:      ingress-nginx-external
      Node:                 gke-gke-prod-main-eu-nap-n2-standard--0363057d-4wmn/10.160.0.182
      Start Time:           Wed, 16 Oct 2024 11:27:23 +0200
      Labels:               app.kubernetes.io/component=controller
                            app.kubernetes.io/instance=ingress-nginx-external
                            app.kubernetes.io/managed-by=Helm
                            app.kubernetes.io/name=ingress-nginx
                            app.kubernetes.io/part-of=ingress-nginx
                            app.kubernetes.io/version=1.11.3
                            bm_domain=infra
                            helm.sh/chart=ingress-nginx-4.11.3
                            performance=critical
                            pod-template-hash=6b6c646789
                            role=public
                            service=nginx-ingress-controller
                            tags.datadoghq.com/service=nginx-ingress-controller
                            tags.datadoghq.com/version=4.11.3
      Annotations:          ad.datadoghq.com/ingress-nginx-controller.check_names: ["nginx","nginx_ingress_controller"]
                            kubectl.kubernetes.io/restartedAt: 2024-10-16T09:48:51+02:00
      Status:               Running
      IP:                   10.88.91.14
      IPs:
        IP:           10.88.91.14
      Controlled By:  ReplicaSet/ingress-nginx-external-controller-6b6c646789
      Containers:
        ingress-nginx-controller:
          Container ID:    containerd://b30b910fbcbb270c29dbe8ecea7be2fca255f6f73d83057b0f201c1635a96b50
          Image:           europe-docker.pkg.dev/registry-prod-zq2k/k8s-image-swapper/registry.k8s.io/ingress-nginx/controller@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7
          Image ID:        europe-docker.pkg.dev/registry-prod-zq2k/k8s-image-swapper/registry.k8s.io/ingress-nginx/controller@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7
          Ports:           80/TCP, 443/TCP, 10254/TCP
          Host Ports:      0/TCP, 0/TCP, 0/TCP
          SeccompProfile:  RuntimeDefault
          Args:
            /nginx-ingress-controller
            --publish-service=$(POD_NAMESPACE)/ingress-nginx-external-controller
            --election-id=ingress-nginx-external-leader
            --controller-class=k8s.io/ingress-nginx
            --ingress-class=nginx
            --configmap=$(POD_NAMESPACE)/ingress-nginx-external-controller
          State:          Running
            Started:      Wed, 16 Oct 2024 11:27:24 +0200
          Ready:          True
          Restart Count:  0
          Limits:
            memory:  5Gi
          Requests:
            cpu:      2
            memory:   4608Mi
          Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
          Readiness:  http-get http://:10254/healthz delay=30s timeout=1s period=10s #success=1 #failure=3
          Environment:
            POD_NAME:       ingress-nginx-external-controller-6b6c646789-6ttl7 (v1:metadata.name)
            POD_NAMESPACE:  network (v1:metadata.namespace)
            LD_PRELOAD:     /usr/local/lib/libmimalloc.so
            HOST_IP:         (v1:status.hostIP)
            DD_SERVICE:      (v1:metadata.labels['tags.datadoghq.com/service'])
            DD_VERSION:      (v1:metadata.labels['tags.datadoghq.com/version'])
          Mounts:
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4fdcm (ro)
      Readiness Gates:
        Type                                       Status
        cloud.google.com/load-balancer-neg-ready   True
      Conditions:
        Type                                       Status
        cloud.google.com/load-balancer-neg-ready   True
        PodReadyToStartContainers                  True
        Initialized                                True
        Ready                                      True
        ContainersReady                            True
        PodScheduled                               True
      Volumes:
        kube-api-access-4fdcm:
          Type:                     Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:   3607
          ConfigMapName:            kube-root-ca.crt
          ConfigMapOptional:        <nil>
          DownwardAPI:              true
      QoS Class:                    Burstable
      Node-Selectors:               cloud.google.com/compute-class=core
      Tolerations:                  cloud.google.com/compute-class=core:NoSchedule
                                    node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                                    node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
      Topology Spread Constraints:  kubernetes.io/hostname:ScheduleAnyway when max skew 1 is exceeded for selector app.kubernetes.io/instance=ingress-nginx-external
                                    topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/instance=ingress-nginx-external
      Events:
        Type    Reason  Age                 From                      Message
        ----    ------  ----                ----                      -------
        Normal  RELOAD  26m (x3 over 3h4m)  nginx-ingress-controller  NGINX reload triggered due to a change in configuration
      
    • kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>

      Name:                     ingress-nginx-external-controller
      Namespace:                network
      Labels:                   app.kubernetes.io/component=controller
                                app.kubernetes.io/instance=ingress-nginx-external
                                app.kubernetes.io/managed-by=Helm
                                app.kubernetes.io/name=ingress-nginx
                                app.kubernetes.io/part-of=ingress-nginx
                                app.kubernetes.io/version=1.11.3
                                helm.sh/chart=ingress-nginx-4.11.3
      Annotations:
                                cloud.google.com/neg: {"exposed_ports":{"80":{}}}
                                cloud.google.com/neg-status:
                                  {"network_endpoint_groups":{"80":"k8s1-b3f18f8b-network-ingress-nginx-external-control-8-0af3d9fc"},"zones":["europe-west1-b","europe-west...
      Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx-external,app.kubernetes.io/name=ingress-nginx
      Type:                     ClusterIP
      IP Family Policy:         SingleStack
      IP Families:              IPv4
      IP:                       100.76.208.116
      IPs:                      100.76.208.116
      Port:                     http  80/TCP
      TargetPort:               http/TCP
      Endpoints:                10.88.134.3:80,10.88.91.14:80,10.88.52.28:80
      Port:                     https  443/TCP
      TargetPort:               https/TCP
      Endpoints:                10.88.134.3:443,10.88.91.14:443,10.88.52.28:443
      Session Affinity:         None
      Internal Traffic Policy:  Cluster
      Events:
        Type    Reason  Age                  From                   Message
        ----    ------  ----                 ----                   -------
        Normal  SYNC    52m (x7 over 60m)    sc-gateway-controller  SYNC on network/ingress-nginx-external-controller was a success
        Normal  ADD     50m                  sc-gateway-controller  network/ingress-nginx-external-controller
        Normal  SYNC    25m (x15 over 50m)   sc-gateway-controller  SYNC on network/ingress-nginx-external-controller was a success
        Normal  Attach  21m (x14 over 2d4h)  neg-controller         Attach 1 network endpoint(s) (NEG "k8s1-b3f18f8b-network-ingress-nginx-external-control-8-0af3d9fc" in zone "europe-west1-d")
        Normal  ADD     20m                  sc-gateway-controller  network/ingress-nginx-external-controller
        Normal  Attach  16m (x12 over 2d4h)  neg-controller         Attach 1 network endpoint(s) (NEG "k8s1-b3f18f8b-network-ingress-nginx-external-control-8-0af3d9fc" in zone "europe-west1-c")
        Normal  Detach  15m (x15 over 2d4h)  neg-controller         Detach 1 network endpoint(s) (NEG "k8s1-b3f18f8b-network-ingress-nginx-external-control-8-0af3d9fc" in zone "europe-west1-d")
        Normal  Detach  14m (x12 over 2d4h)  neg-controller         Detach 1 network endpoint(s) (NEG "k8s1-b3f18f8b-network-ingress-nginx-external-control-8-0af3d9fc" in zone "europe-west1-c")
        Normal  SYNC    13m (x9 over 20m)    sc-gateway-controller  SYNC on network/ingress-nginx-external-controller was a success
        Normal  ADD     10m                  sc-gateway-controller  network/ingress-nginx-external-controller
        Normal  SYNC    2m35s (x7 over 10m)  sc-gateway-controller  SYNC on network/ingress-nginx-external-controller was a success
        Normal  ADD     57s                  sc-gateway-controller  network/ingress-nginx-external-controller
        Normal  SYNC    29s (x4 over 57s)    sc-gateway-controller  SYNC on network/ingress-nginx-external-controller was a success
      

How to reproduce this issue:

Install a kind cluster

kind create cluster

Install the ingress controller

Install the ingress controller with modified liveness/readiness timings to improve the reproducibility.
Admission webhooks are disabled here to avoid swamping ingress-nginx when creating the large number of ingresses required to reproduce this bug.

helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --version 4.11.3 \
  --set controller.admissionWebhooks.enabled=false \
  --set controller.livenessProbe.initialDelaySeconds=0,controller.livenessProbe.periodSeconds=1,controller.livenessProbe.timeoutSeconds=10,controller.livenessProbe.failureThreshold=600 \
  --set controller.readinessProbe.initialDelaySeconds=0,controller.readinessProbe.periodSeconds=1,controller.readinessProbe.timeoutSeconds=10,controller.readinessProbe.failureThreshold=600 \
  --namespace ingress-nginx --create-namespace

Create a simple service in the default namespace

Here we're creating a simple service using nc that will always return 200.It's not protocol aware, just returning a static body.

Apply the below manifests with:

kubectl --namespace default apply --filename /path/to/manifests.yaml
Manifests
---
apiVersion: v1
data:
  server.sh: "#!/bin/sh\ncat <<- EOS\n\tHTTP/1.1 200 OK\r\n\tContent-Length: 14\r\n\r\n\tHello World!\r\nEOS"
kind: ConfigMap
metadata:
  name: hello
  namespace: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
  namespace: default
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: hello
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: hello
    spec:
      containers:
        - command:
            - nc
            - -lkvp
            - "8080"
            - -e
            - serve
          image: docker.io/alpine:3.14
          imagePullPolicy: IfNotPresent
          name: hello
          ports:
            - containerPort: 8080
              name: http
              protocol: TCP
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /usr/local/bin/serve
              name: programs
              subPath: server.sh
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      volumes:
        - configMap:
            defaultMode: 511
            name: hello
          name: programs
---
apiVersion: v1
kind: Service
metadata:
  name: hello
  namespace: default
spec:
  ipFamilies:
    - IPv4
  ipFamilyPolicy: SingleStack
  ports:
    - name: http
      port: 80
      targetPort: http
  selector:
    app: hello

Create 1k ingresses pointing to this service

Run the below python script to create 1000 ingresses (hello-[0-999].example.com) with our simple service as the backend.

from subprocess import Popen, PIPE, STDOUT
import json

ingress = {
  "apiVersion": "networking.k8s.io/v1",
  "kind": "Ingress",
  "metadata": {
    "name": "hello",
  },
  "spec": {
    "ingressClassName": "nginx",
    "rules": [
      {
        "http": {
          "paths": [
            {
              "path": "/",
              "pathType": "Prefix",
              "backend": {
                "service": {
                  "name": "hello",
                  "port": {
                    "number": 80
                  }
                }
              }
            }
          ]
        }
      }
    ]
  }
}

for i in range(1000):
    ingress["metadata"]["name"] = "hello-" + str(i)
    ingress["spec"]["rules"][0]["host"] = "hello-" + str(i) + ".example.com"
    p = Popen(['kubectl', '--namespace', 'default', 'apply', '--filename', '-'], stdout=PIPE, stdin=PIPE, stderr=PIPE, text=True)
    print(p.communicate(input=json.dumps(ingress)))

You will have to wait some time for ingress-nginx to update it's config with all these changes.

Create a test pod to confirm the service and ingress are alive.

kubectl run --namespace default --context kind-kind --tty --stdin --restart=Never --command --image nicolaka/netshoot:latest test -- bash

In the console on this pod, run the following to confirm we have a stable environment:

curl --verbose hello.default.svc.cluster.local.
curl --verbose --header 'Host: hello-999.example.com' ingress-nginx-controller.ingress-nginx.svc.cluster.local.

You should see a successful response for each of them.

Generate load on the kubernetes API server / etcd

In order to reproduce this bug (reliably?) some load needs to be added to kubernetes itself.

I use kube-burner here to create the load.
You will need to wait until PUT/PATCH/DELETE commands are being run on existing kubernetes objects in order to reproduce.

Below is my configuration.
During the first job, objects are created, and this doesn't seem to be enough to reproduce the bug.
Wait until the api-intensive-patch job has started before continuing.

kube-burner init -c ./api-intensive.yml
Configuration

./api-intensive.yml:

---
jobs:
  - name: api-intensive
    jobIterations: 50
    qps: 4
    burst: 4
    namespacedIterations: true
    namespace: api-intensive
    podWait: false
    cleanup: true
    waitWhenFinished: true
    objects:
      - objectTemplate: templates/deployment.yaml
        replicas: 1
      - objectTemplate: templates/configmap.yaml
        replicas: 1
      - objectTemplate: templates/secret.yaml
        replicas: 1
      - objectTemplate: templates/service.yaml
        replicas: 1

  - name: api-intensive-patch
    jobType: patch
    jobIterations: 10
    qps: 2
    burst: 2
    objects:
      - kind: Deployment
        objectTemplate: templates/deployment_patch_add_label.json
        labelSelector: {kube-burner-job: api-intensive}
        patchType: "application/json-patch+json"
        apiVersion: apps/v1
      - kind: Deployment
        objectTemplate: templates/deployment_patch_add_pod_2.yaml
        labelSelector: {kube-burner-job: api-intensive}
        patchType: "application/apply-patch+yaml"
        apiVersion: apps/v1
      - kind: Deployment
        objectTemplate: templates/deployment_patch_add_label.yaml
        labelSelector: {kube-burner-job: api-intensive}
        patchType: "application/strategic-merge-patch+json"
        apiVersion: apps/v1

  - name: api-intensive-remove
    qps: 2
    burst: 2
    jobType: delete
    waitForDeletion: true
    objects:
      - kind: Deployment
        labelSelector: {kube-burner-job: api-intensive}
        apiVersion: apps/v1

  - name: ensure-pods-removal
    qps: 10
    burst: 10
    jobType: delete
    waitForDeletion: true
    objects:
      - kind: Pod
        labelSelector: {kube-burner-job: api-intensive}

  - name: remove-services
    qps: 2
    burst: 2
    jobType: delete
    waitForDeletion: true
    objects:
      - kind: Service
        labelSelector: {kube-burner-job: api-intensive}

  - name: remove-configmaps-secrets
    qps: 2
    burst: 2
    jobType: delete
    objects:
      - kind: ConfigMap
        labelSelector: {kube-burner-job: api-intensive}
      - kind: Secret
        labelSelector: {kube-burner-job: api-intensive}

./templates/deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-intensive-{{.Replica}}
  labels:
    group: load
    svc: api-intensive-{{.Replica}}
spec:
  replicas: 1
  selector:
    matchLabels:
      name: api-intensive-{{.Replica}}
  template:
    metadata:
      labels:
        group: load
        name: api-intensive-{{.Replica}}
    spec:
      containers:
      - image: registry.k8s.io/pause:3.1
        name: api-intensive-{{.Replica}}
        resources:
          requests:
            cpu: 10m
            memory: 10M
        volumeMounts:
          - name: configmap
            mountPath: /var/configmap
          - name: secret
            mountPath: /var/secret
      dnsPolicy: Default
      terminationGracePeriodSeconds: 1
      # Add not-ready/unreachable tolerations for 15 minutes so that node
      # failure doesn't trigger pod deletion.
      tolerations:
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 900
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 900
      volumes:
        - name: configmap
          configMap:
            name: configmap-{{.Replica}}
        - name: secret
          secret:
            secretName: secret-{{.Replica}}

./templates/deployment_patch_add_pod_2.yaml:

kind: Deployment
apiVersion: apps/v1
spec:
  template:
    spec:
      containers:
      - image: registry.k8s.io/pause:3.1
        name: api-intensive-2
        resources:
          requests:
            cpu: 10m
            memory: 10M

./templates/service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: service-{{.Replica}}
spec:
  selector:
    name: api-intensive-{{.Replica}}
  ports:
  - port: 80
    targetPort: 80

./templates/deployment_patch_add_label.yaml:

kind: Deployment
apiVersion: apps/v1
metadata:
  labels:
    new_key_{{.Iteration}}: new_value_{{.Iteration}}

./templates/deployment_patch_add_label.json:

[
	{
		"op": "add",
		"path": "/metadata/labels/new_key",
		"value": "new_value"
	}
]

./templates/configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap-{{.Replica}}
data:
  data.yaml: |-
    a: 1
    b: 2
    c: 3

./templates/secret.yaml:

apiVersion: v1
kind: Secret
metadata:
  name: secret-{{.Replica}}
type: Opaque
data:
  password: Zm9vb29vb29vb29vb29vbwo=

GET constantly the endpoint

In the test pod we created earlier, run the following command which will curl the hello-700.example.com ingress constantly until 404 is returned.

get_url() {
  curl \
    --show-error \
    --verbose \
    --silent \
    --output /tmp/curl-body.dat \
    --write-out '%{http_code}\n' \
    --header 'Host: hello-700.example.com' \
    http://ingress-nginx-controller.ingress-nginx.svc.cluster.local. \
    2> /tmp/curl-error.log
}

# Curl hello-700.example.com ingress until the http status is 404
while [ "$(get_url)" != 404 ]; do
  # Nothing at all, as quickly as possible
  :
done

# Print the last request error log and body.
cat /tmp/curl-error.log /tmp/curl-body.dat

While this is running, in a different shell, perform a rollout restart of the ingress-nginx controller.

kubectl --namespace ingress-nginx rollout restart deployment ingress-nginx-controller

It may take a couple of attempts of rolling out the deployment, but eventually you should see the loop in the test pod break and something similar to the following stderr and body printed:

* Host ingress-nginx-controller.ingress-nginx.svc.cluster.local.:80 was resolved.
* IPv6: (none)
* IPv4: 10.96.199.65
*   Trying 10.96.199.65:80...
* Connected to ingress-nginx-controller.ingress-nginx.svc.cluster.local. (10.96.199.65) port 80
> GET / HTTP/1.1
> Host: hello-700.example.com
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 404 Not Found
< Date: Mon, 14 Oct 2024 15:50:56 GMT
< Content-Type: text/html
< Content-Length: 146
< Connection: keep-alive
<
{ [146 bytes data]
* Connection #0 to host ingress-nginx-controller.ingress-nginx.svc.cluster.local. left intact
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>

Interestingly you can't see the 404 in the logs of either the new or old nginx pod in this reproduction.
This is different than what we see in our production cluster, where the 404s are present in the logs in the newly created ingress-nginx pod.

Anything else we need to know:

Here's a breakdown of what I think some of the details around the root-cause of this bug in the code are:

The health check here basically just checks that nginx is running (which it will be very early on) and that the /is-dynamic-lb-initialized path returns with 2xx:

statusCode, _, err := nginx.NewGetStatusRequest("/is-dynamic-lb-initialized")
if err != nil {
return fmt.Errorf("checking if the dynamic load balancer started: %w", err)
}

The is-dynamic-lb-initialized location is handled by this Lua module, which is just checking if any backends are configured.

local backend_data = configuration.get_backends_data()
if not backend_data then
ngx.exit(ngx.HTTP_INTERNAL_SERVER_ERROR)
return
end

This will basically always be true after the first reload, as as soon as there is at least one ingress in the cache, it will detect a difference and trigger a reload with the new backends configuration:

if !utilingress.IsDynamicConfigurationEnough(pcfg, n.runningConfig) {
klog.InfoS("Configuration changes detected, backend reload required")
hash, err := hashstructure.Hash(pcfg, hashstructure.FormatV1, &hashstructure.HashOptions{
TagName: "json",
})
if err != nil {
klog.Errorf("unexpected error hashing configuration: %v", err)
}
pcfg.ConfigurationChecksum = fmt.Sprintf("%v", hash)
err = n.OnUpdate(*pcfg)
if err != nil {
n.metricCollector.IncReloadErrorCount()
n.metricCollector.ConfigSuccess(hash, false)
klog.Errorf("Unexpected failure reloading the backend:\n%v", err)
n.recorder.Eventf(k8s.IngressPodDetails, apiv1.EventTypeWarning, "RELOAD", fmt.Sprintf("Error reloading NGINX: %v", err))
return err
}
klog.InfoS("Backend successfully reloaded")
n.metricCollector.ConfigSuccess(hash, true)
n.metricCollector.IncReloadCount()
n.recorder.Eventf(k8s.IngressPodDetails, apiv1.EventTypeNormal, "RELOAD", "NGINX reload triggered due to a change in configuration")
}

Which calls the OnUpdate here:
o, err := n.command.ExecCommand("-s", "reload").CombinedOutput()

My suspicion is that the cached ingresses in k8sStore do not represent the full state initially, but rather include ingresses returned from the API server in the first paginated response(s):

func (s *k8sStore) ListIngresses() []*ingress.Ingress {

This can be validated by inspecting the number of ingresses returned by this function during startup when a large number of ingresses are present.

A potential solution would be to reject a reload of Nginx until we're sure that the cache is fully populated on the initial sync.

@Izzette Izzette added the kind/bug Categorizes issue or PR as related to a bug. label Oct 16, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 16, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@longwuyuan
Copy link
Contributor

/remove-kind bug

The bug label can be re-applied after a developer accepts that the triage as a bug.

There are multiple pieces of information to look at ;

  • Even a expensive commercial version of the Nginx webserver, is not free from the problem of impacting traffic, when a the reload of nginx.conf occurs. And in the case of above average number of ingresses and rules, we already have issues and info on the traffic impact, and there is nothing we can do about it today. That is conclusive. If you add that the number of changes to the nginx.conf being reloaded is also large, then the disruption is even more unavoidable, until the config reconciles.

  • There are other other users of the controller who do find the optimum config to get reliability. But none of them reported the ingress-nginx controller service --type as ClusterIP.. You are showing service --type ClusterIP and looks like generating changes and load etc etc from inside the cluster, over ClusterIP. We acknowledge that this will break traffic and regardless of what you think are bugs/problems and their solutions, the CPU/Memory/I/O and their speeds are directly co-related to the various race conditions anyone can cook up. What the project is working on is to split the control-plane from the data-plane so that both the performance & also the security is improved. Look at current open PRs related.

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Oct 16, 2024
@longwuyuan
Copy link
Contributor

Also, there have been attempts to play with various timeouts by other users who reported similar expectation. But since that is a extremely specific config for each environment and each use-case of a given-environment, I think that is one area to explore.

@Izzette
Copy link
Author

Izzette commented Oct 16, 2024

... ingress-nginx controller service --type as ClusterIP.

Actually, this issue impacts us in our specific case where we're not using the service cluster IP, but rather the external load balancer is hitting the pod IP directly through the GCP LB NEG. It's calling the same health check as kubernetes, and the ingress-nginx-controller is returning erroneously once again that it is ready, when it has not yet loaded the ingress config. The problem is that the "initial sync" of the kubernetes state for ingresses isn't completed until after the controller reports that the initial sync is complete and the dynamic loadbalancer is initialized.

The k8s Service implementation for the ingress or backends are not relevant in this case. 404 is being returned by nginx because the appropriate ingress isn't populated in the lua shared dictionary nor are the servers/locations templated in the nginx configuration.

@Izzette
Copy link
Author

Izzette commented Oct 16, 2024

when a the reload of nginx.conf occurs

Actually, we don't face any problems when a reload of nginx occurs. Rather, multiple reloads on startup of ingress-nginx pods are merely a symptom of the specific implementation in ingress-nginx-controller and it's interaction with the kubernetes API (the Golang side) resulting in the bug.

@longwuyuan
Copy link
Contributor

@Izzette thank you for the comments. Helps.

This is a vast topic so discussions are only possible with specifics so I am looking at your reproduce steps. And we are rootcausing the 404

  • I am a little lost on the reproduce steps. Where are you running the curl from
  • Why do you have to customize delayseconds and the other timeouts for probes. Can the test be run with defaults
  • While there was high load on the apiserver, what was the cpu/mem/bandwidth/conntrack/inodes and i/o statistics
  • What was the output of k describe ing for hello-700 or k get ep at the time of the 400. Very likely this showed no endpoints

So above comments from me are directed at the tests but even with any other tests, if there is starvation of resources like cpu/mem etc etc that I listed earlier, there will be events leading to the endpointSlice becoming empty. It is expected.

Even though the results will be the same, I would do these tests by first installing Metallb in the kind cluster. And specifying the docker container ipaddress as the starting and ending of the ipaddress pool. Then I would make /etc/hosts entry for hello-700 on the host running kind. And send curl request from the host shell to hello-700.example.com. Simulates your use-case closer. (not that results will be any different though).

lastly, to repeat, if you starve the cluster of cpu/mem/bandwidth/conntrack/inodes & i/o and also generate load on api-server and top it with a rollout, the /healthz endpoint of the controller may respond ok and thus move pod to ready state. I am not surprised.

And the only choice we have at this time is a play on the timeouts, specifically increasing the delaySeconds etc etc. And all other configurables related to probe behaviour.

@longwuyuan
Copy link
Contributor

And just for sanity sake, the tests will be the same if you use kubectl create deploy httpd --image httpd:alpine --port 80 && kubectl expose deploy httpd && kubectl create ing httpd --class nginx --rule httpd.example.com/"*"=httpd:80 . Just so that ingress is for HTTP/HTTPS so once again a closer simulation (not that it will matter much). Thanks

@longwuyuan
Copy link
Contributor

Ah forgot to mention, I also have interest to use 5 replicas and set min-available to 3. Then do the load and rollout as per your design.

@Izzette
Copy link
Author

Izzette commented Oct 17, 2024

I am able to reproduce with httpd.example.com using the backend image docker.io/library/httpd:alpine just fine. I of course, need other ingresses in order for them to be loaded before httpd.example.com generating the partial config.

 (kind-kind/default) 0 ✓ izzi@Isabelles-MacBook-Pro.local ~ $ kubectl run --namespace default --context kind-kind --tty --stdin --restart=Never --command --image nicolaka/netshoot:latest test -- bash
test:~# get_url() {
  curl \
    --show-error \
    --verbose \
    --silent \
    --output /tmp/curl-body.dat \
    --write-out '%{http_code}\n' \
    --header 'Host: httpd.example.com' \
    http://ingress-nginx-controller.ingress-nginx.svc.cluster.local. \
    2> /tmp/curl-error.log
}

# Curl httpd.example.com ingress until the http status is 404
while [ "$(get_url)" != 404 ]; do
  # Nothing at all, as quickly as possible
  :
done

# Print the last request error log and body.
cat /tmp/curl-error.log /tmp/curl-body.dat
* Host ingress-nginx-controller.ingress-nginx.svc.cluster.local.:80 was resolved.
* IPv6: (none)
* IPv4: 10.96.199.65
*   Trying 10.96.199.65:80...
* Connected to ingress-nginx-controller.ingress-nginx.svc.cluster.local. (10.96.199.65) port 80
> GET / HTTP/1.1
> Host: httpd.example.com
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 404 Not Found
< Date: Thu, 17 Oct 2024 07:19:55 GMT
< Content-Type: text/html
< Content-Length: 146
< Connection: keep-alive
<
{ [146 bytes data]
* Connection #0 to host ingress-nginx-controller.ingress-nginx.svc.cluster.local. left intact
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
test:~#

while I run in a different shell:

 (kind-kind/default) 0 ✓ izzi@Isabelles-MBP.barityo.lan ~ $ kubectl  --namespace ingress-nginx  --context kind-kind rollout restart deployment ingress-nginx-controller
deployment.apps/ingress-nginx-controller restarted

Before the rollout restart, this works fine of course, as with the other backend.

if I redeploy ingress-nginx with replicas 5 and maxUnavailable 2 I can also reproduce this issue:

 (kind-kind/default) 0 ✓ izzi@Isabelles-MBP.barityo.lan ~ $ helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --version 4.11.3 \
  --set controller.admissionWebhooks.enabled=false \
  --set controller.replicaCount=5,controller.autoscaling.minAvailable=3 \
  --set controller.livenessProbe.initialDelaySeconds=0,controller.livenessProbe.periodSeconds=1,controller.livenessProbe.timeoutSeconds=10,controller.livenessProbe.failureThreshold=600  \
  --set controller.readinessProbe.initialDelaySeconds=0,controller.readinessProbe.periodSeconds=1,controller.readinessProbe.timeoutSeconds=10,controller.readinessProbe.failureThreshold=600 \
  --namespace ingress-nginx --create-namespace
Release "ingress-nginx" has been upgraded. Happy Helming!
NAME: ingress-nginx
LAST DEPLOYED: Thu Oct 17 09:27:37 2024
NAMESPACE: ingress-nginx
STATUS: deployed
REVISION: 13
TEST SUITE: None
NOTES:
The ingress-nginx controller has been installed.
It may take a few minutes for the load balancer IP to be available.
You can watch the status by running 'kubectl get service --namespace ingress-nginx ingress-nginx-controller --output wide --watch'

An example Ingress that makes use of the controller:
  apiVersion: networking.k8s.io/v1
  kind: Ingress
  metadata:
    name: example
    namespace: foo
  spec:
    ingressClassName: nginx
    rules:
      - host: www.example.com
        http:
          paths:
            - pathType: Prefix
              backend:
                service:
                  name: exampleService
                  port:
                    number: 80
              path: /
    # This section is only required if TLS is to be enabled for the Ingress
    tls:
      - hosts:
        - www.example.com
        secretName: example-tls

If TLS is enabled for the Ingress, a Secret containing the certificate and key must also be provided:

  apiVersion: v1
  kind: Secret
  metadata:
    name: example-tls
    namespace: foo
  data:
    tls.crt: <base64 encoded cert>
    tls.key: <base64 encoded key>
  type: kubernetes.io/tls
 (kind-kind/default) 0 ✓ izzi@Isabelles-MBP.barityo.lan ~ $ kubectl --namespace ingress-nginx get pods
NAME                                       READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-9df47b74c-htjs6   1/1     Running   0          6s
ingress-nginx-controller-9df47b74c-nncsr   1/1     Running   0          6s
ingress-nginx-controller-9df47b74c-rbfxb   1/1     Running   0          7m53s
ingress-nginx-controller-9df47b74c-tblkt   1/1     Running   0          6s
ingress-nginx-controller-9df47b74c-zcd99   1/1     Running   0          6s
 (kind-kind/default) 0 ✓ izzi@Isabelles-MBP.barityo.lan ~ $ kubectl --namespace ingress-nginx --co
ntext kind-kind rollout restart deployment/ingress-nginx-controller
deployment.apps/ingress-nginx-controller restarted

While in my test pod:

test:~# get_url() {
  curl \
    --show-error \
    --verbose \
    --silent \
    --output /tmp/curl-body.dat \
    --write-out '%{http_code}\n' \
    --header 'Host: httpd.example.com' \
    http://ingress-nginx-controller.ingress-nginx.svc.cluster.local. \
    2> /tmp/curl-error.log
}

# Curl httpd.example.com ingress until the http status is 404
while [ "$(get_url)" != 404 ]; do
  # Nothing at all, as quickly as possible
  :
done

# Print the last request error log and body.
cat /tmp/curl-error.log /tmp/curl-body.dat
* Host ingress-nginx-controller.ingress-nginx.svc.cluster.local.:80 was resolved.
* IPv6: (none)
* IPv4: 10.96.199.65
*   Trying 10.96.199.65:80...
* Connected to ingress-nginx-controller.ingress-nginx.svc.cluster.local. (10.96.199.65) port 80
> GET / HTTP/1.1
> Host: httpd.example.com
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 404 Not Found
< Date: Thu, 17 Oct 2024 07:29:21 GMT
< Content-Type: text/html
< Content-Length: 146
< Connection: keep-alive
<
{ [146 bytes data]
* Connection #0 to host ingress-nginx-controller.ingress-nginx.svc.cluster.local. left intact
<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
test:~#

@longwuyuan
Copy link
Contributor

hi, its so very helpful when the data is so abundant and precise.

I have a ton of things to communicate here after this data. But I was wondering if we can get on a screenshare to get the precise data, that is kind of more relatable, from the perspective of creating action-items for developers of this project.

Any chance you can meet on meet.jit.si

@longwuyuan
Copy link
Contributor

I am also on slack, if that works for you. The advantage being realtime conversation possible if it adds value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

3 participants