Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy cannot connect to the XDS Server #36150

Open
ShivanshiDhawan opened this issue Sep 15, 2024 · 2 comments
Open

Envoy cannot connect to the XDS Server #36150

ShivanshiDhawan opened this issue Sep 15, 2024 · 2 comments
Labels
area/xds question Questions that are neither investigations, bugs, nor enhancements

Comments

@ShivanshiDhawan
Copy link

XDS server was restarted and envoy got disconnected. But it wasn't able to connect again to the XDS server for around 1.5 hours. Then envoy was restarted again and was able to connect back to the XDS server.

Envoy logs have only warning message getting logged again and again:
[2024-09-11 09:43:24.904][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] DeltaAggregatedResources gRPC config stream to [] closed since 49223s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 113

I understand the part that Envoy is using backoff_strategy for retries and we get this warning message if error still persists after backoff cycle.

I have few doubts on this:

  • Why wasn't Envoy able to reconnect for 1.5 hours?
  • Will envoy keep on retrying until it connects back using backoff strategy or is there any max_attempts?
  • XDS server is configured with LOGICAL DNS as the service discovery. Does envoy selects the host for every retry? Or it is same as the last attempt after some n number of retries?
  • Circuit breaker is configured with max_retries of 50. But envoy retried more than that number of times.
    Metrics upstream_cx_connect_fail has value 3034 and upstream_cx_overflow is 0.

XDS server config. It is running as a headless service in Kubernetes.

 "static_resources":
    "clusters":
    - "circuit_breakers":
        "thresholds":
        - "max_connections": 100000
          "max_pending_requests": 100000
          "max_requests": 60000000
          "max_retries": 50
          "priority": "HIGH"
        - "max_connections": 100000
          "max_pending_requests": 100000
          "max_requests": 60000000
          "max_retries": 50
          "priority": "DEFAULT"
      "connect_timeout": "1s"
      "dns_lookup_family": "V4_ONLY"
      "typed_extension_protocol_options":
        "envoy.extensions.upstreams.http.v3.HttpProtocolOptions":
          "@type": "type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions"
          "explicit_http_config":
            "http2_protocol_options":
              "connection_keepalive":
                "interval": "30s"
                "timeout": "20s"
      "lb_policy": "RANDOM"
      "load_assignment":
        "cluster_name": ""
        "endpoints":
        - "lb_endpoints":
          - "endpoint":
              "address":
                "socket_address":
                  "address": {{ "xx.namespace.svc.cluster.local." }}
                  "port_value": {{ $xds_server.port }}
      "name": ""
      "type": "LOGICAL_DNS"
      "upstream_connection_options":
        "tcp_keepalive":
          "keepalive_interval": 10
          "keepalive_probes": 3
          "keepalive_time": 30
@ShivanshiDhawan ShivanshiDhawan added the triage Issue requires triage label Sep 15, 2024
@zuercher
Copy link
Member

Why wasn't Envoy able to reconnect for 1.5 hours?

Difficult to say, but perhaps the first entry in the DNS result of xx.namespace.svc.cluster.local. continued to be a bad host? (See the discussion of LOGICAL_DNS below.)

Will envoy keep on retrying until it connects back using backoff strategy or is there any max_attempts?

It will continue to try to connect, but will eventually give up waiting for configuration and proceed with whatever static configuration may be available. See the initial_fetch_timeout configuration on ConfigSource.

XDS server is configured with LOGICAL DNS as the service discovery. Does envoy selects the host for every retry? Or it is same as the last attempt after some n number of retries?

https://www.envoyproxy.io/docs/envoy/v1.31.1/intro/arch_overview/upstream/service_discovery#logical-dns

I think each retry will be a new connection attempt. Whether it chooses the same host is dependent on the cluster's endpoints and the load balancing policy. Here it goes back to the DNS result and will always choose the first host in the most recent DNS response (this is the definition of LOGICAL_DNS). You might consider whether STRICT_DNS is a better choice here. That will cause Envoy to apply its load balancing policy so if there are multiple hosts each will be attempted eventually.

Circuit breaker is configured with max_retries of 50. But envoy retried more than that number of times.
Metrics upstream_cx_connect_fail has value 3034 and upstream_cx_overflow is 0.

That circuit breaker doesn't apply here. Partly this is because the max-retries circuit breaker is the maximum number of concurrent retries on a cluster (e.g. retries as configured in an HttpConnectionManager retry_policy). Partly because I don't think we do the circuit breaker accounting in the XDS client code. In any event, XDS only ever has a single grpc request open to an XDS server at a given time so even if the circuit breaker accounting is taking place, it'll never hit the limits.

@zuercher zuercher added question Questions that are neither investigations, bugs, nor enhancements area/xds and removed triage Issue requires triage labels Sep 16, 2024
@ShivanshiDhawan
Copy link
Author

Hey @zuercher ,
Thanks for the response. So, with LOGICAL_DNS, we would have the default 5000ms dns_refresh_rate. Hence, eventually DNS result of xx.namespace.svc.cluster.local.'s first IP address would have pointed to the correct host. But still for 1.5 hours envoy wasn’t able to connect back to the XDS server. Other point to note, there were 3 envoy pods and only 1 of them faced this issue. Remaining envoy pods connected back to the XDS server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/xds question Questions that are neither investigations, bugs, nor enhancements
Projects
None yet
Development

No branches or pull requests

2 participants