Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd-destination: unable to connect to validator #11597

Closed
matthiasdeblock opened this issue Nov 9, 2023 · 19 comments · Fixed by linkerd/website#1794
Closed

Linkerd-destination: unable to connect to validator #11597

matthiasdeblock opened this issue Nov 9, 2023 · 19 comments · Fixed by linkerd/website#1794
Assignees

Comments

@matthiasdeblock
Copy link

What is the issue?

Hi

After installing linkerd-cni. the Linkerd pods are unable to start due to the following error:

Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Host is unreachable (os error 113)

How can it be reproduced?

Install linkerd-cni and linkerd on a flatcar kubernetes 1.28.3 cluster with cilium as CNI.

Logs, error output, etc

2023-11-09T11:42:46.686000Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2023-11-09T11:42:46.686030Z DEBUG linkerd_network_validator: token="KXyajGp2VZRdLXMQEEAqBJoJUeNIUUUhajU7NmAqDTmCn9fcj9GyrFcDdlGURTo\n"
2023-11-09T11:42:46.686037Z  INFO linkerd_network_validator: Connecting to 1.1.1.1:20001
2023-11-09T11:42:47.586457Z ERROR linkerd_network_validator: Unable to connect to validator. Please ensure iptables rules are rewriting traffic as expected error=Host is unreachable (os error 113)
2023-11-09T11:42:47.586481Z ERROR linkerd_network_validator: error=Host is unreachable (os error 113)

output of linkerd check -o short

linkerd-existence
-----------------
- No running pods for "linkerd-destination" ^C

Environment

  • Kubernetes-version: 1.28.3
  • Cilium version: 1.14.3
  • Linkerd-cni-version: stable-2.14.3
  • Linkerd-version: stable-2.14.3
  • OS: Flatcar Openstack 3510.2.8

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

@mateiidavid
Copy link
Member

@matthiasdeblock hi, sounds like the validator is detecting erroneous configuration in your network stack. The validator is attempting to connect to a server it creates in order to test iptables destination re-writing works as expected. I see that you're using Cilium. We have a cluster configuration section in our docs aimed at getting Linkerd to work with Cilium. Their socket level load balancing capability can sometimes mess up routing for other services. Can you check if that's affecting you here?

@matthiasdeblock
Copy link
Author

Hi @mateiidavid
I did set 'bpf-lb-sock-hostns-only: "true"' but that did not fix the issue here. Without linkerd-cni everything is working fine.

@mateiidavid
Copy link
Member

If you think linkerd-cni is the culprit, I'd suggest having a look at some logs. Specifically:

  • Does the installer (linkerd-cni daemonset pod) report anything?
  • Can you get access to kubelet logs to verify whether plugin runs have unsuccessful?
  • Does your CNI host configuration file contain linkerd-cni's configuration?

I'd perhaps start with the last one if it's easy. It might be that the configuration wasn't appended properly for some reason.

@matthiasdeblock
Copy link
Author

If you think linkerd-cni is the culprit, I'd suggest having a look at some logs. Specifically:

  • Does the installer (linkerd-cni daemonset pod) report anything?
  • Can you get access to kubelet logs to verify whether plugin runs have unsuccessful?
  • Does your CNI host configuration file contain linkerd-cni's configuration?

I'd perhaps start with the last one if it's easy. It might be that the configuration wasn't appended properly for some reason.

I'll give it a retry next week. I did check all these but I'll give it another look:

  • The cni pods did not report any issues
  • The plugin just installs correctly and is up and running withing couple of seconds
  • The CNI config file mentioned the location of the Cilium CNI plugin conf.

I'll verify this by the beginning of next week.

Regards

@kflynn
Copy link
Member

kflynn commented Dec 18, 2023

@matthiasdeblock Any joy retrying this?

@kflynn
Copy link
Member

kflynn commented Jan 4, 2024

@matthiasdeblock Happy new year! Still curious if you got a chance to retry things? 🙂

@matthiasdeblock
Copy link
Author

Hi
Sorry for the delay, we will be testing again in the upcoming days.
Regards
Matthias

@Driesvanherpe
Copy link

Hi,

As a colleague of @matthiasdeblock i'd like to give some extra info about this issue:
Logs of the cni pod:

[2024-04-03 09:12:34] Wrote linkerd CNI binaries to /host/opt/cni/bin
[2024-04-03 09:12:34] Installing CNI configuration for /host/etc/cni/net.d/05-cilium.conflist
[2024-04-03 09:12:34] Using CNI config template from CNI_NETWORK_CONFIG environment variable.
      "k8s_api_root": "https://__KUBERNETES_SERVICE_HOST__:__KUBERNETES_SERVICE_PORT__",
      "k8s_api_root": "https://10.12.0.1:__KUBERNETES_SERVICE_PORT__",
[2024-04-03 09:12:34] CNI config: {
  "name": "linkerd-cni",
  "type": "linkerd-cni",
  "log_level": "info",
  "policy": {
      "type": "k8s",
      "k8s_api_root": "https://10.12.0.1:443",
      "k8s_auth_token": "__SERVICEACCOUNT_TOKEN__"
  },
  "kubernetes": {
      "kubeconfig": "/etc/cni/net.d/ZZZ-linkerd-cni-kubeconfig"
  },
  "linkerd": {
    "incoming-proxy-port": 4143,
    "outgoing-proxy-port": 4140,
    "proxy-uid": 2102,
    "ports-to-redirect": [],
    "inbound-ports-to-ignore": ["4191","4190"],
    "simulate": false,
    "use-wait-flag": false
  }
}
[2024-04-03 09:12:34] Created CNI config /host/etc/cni/net.d/05-cilium.conflist
Setting up watches.
Watches established.

Looks like the config doesn't get written to the file, contents of /etc/cni/net.d/05-cilium.conflist

{
  "cniVersion": "0.3.1",
  "name": "cilium",
  "plugins": [
    {
       "type": "cilium-cni",
       "enable-debug": false,
       "log-file": "/var/run/cilium/cilium-cni.log"
    }
  ]
}

alpeb added a commit to linkerd/linkerd2-proxy-init that referenced this issue Apr 4, 2024
Fixes linkerd/linkerd2#11597

When the cni plugin is triggered, it validates that the proxy has been
injected into the pod before setting up the iptables rules. It does so
by looking for the "linkerd-proxy" container. However, when the proxy is
injected as a native sidecar, it gets added as an _init_ container, so
it was being disregarded here.

We don't have integration tests for validating native sidecars when
using linkerd-cni because [Calico doesn't work in k3s since k8s
1.27](k3d-io/k3d#1375), and we require k8s
1.29 for using native sidecars.
I did nevertheless successfully test this fix in an AKS cluster.
@matthiasdeblock
Copy link
Author

Hi

As our cluster is air-gapped I noticed the 1.1.1.1 as connection address isn't correct. I've fixed this in our helm chart and we are now getting a bit further but still running into an error:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-56f777c8b6-8sw9c -c linkerd-network-validator

2024-04-05T05:33:28.979251Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-04-05T05:33:28.979293Z DEBUG linkerd_network_validator: token="y3SgDWabwG6jtxhXFrYYBB4cSHHiSKjbSsaDV29f89tkwrWjmJXtvMz9lmyWb5p\n"
2024-04-05T05:33:28.979308Z  INFO linkerd_network_validator: Connecting to <kubernetes_api>:6443
2024-04-05T05:33:28.981087Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.70.197:57332
2024-04-05T05:33:38.980507Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

(using the kubernetes api IP to connect to)

So it now connects but is still throwing an error.

Regards
Matthias

@alpeb
Copy link
Member

alpeb commented Apr 10, 2024

Thanks @matthiasdeblock for circling back. With version 2.14.9 we have added a cni-repair-controller component that should detect race conditions between the cluster's cni and linkerd-cni. You can enable it via the linkerd2-cni chart value repairController.enabled=true.
If that doesn't do the trick, there's another fix in linkerd/linkerd2-proxy-init#360 that might work for you, so please let me know and I can provide an image to test that out.

@matthiasdeblock
Copy link
Author

Thanks @matthiasdeblock for circling back. With version 2.14.9 we have added a cni-repair-controller component that should detect race conditions between the cluster's cni and linkerd-cni. You can enable it via the linkerd2-cni chart value repairController.enabled=true. If that doesn't do the trick, there's another fix in linkerd/linkerd2-proxy-init#360 that might work for you, so please let me know and I can provide an image to test that out.

Hi

The cni-repair-controller just keeps restarting the linkerd control plane. This isn't fixing the issue.

You have linked linkerd/linkerd2-proxy-init#362 as well, can this be the issue we are running into?

Regards
Matthias

@alpeb
Copy link
Member

alpeb commented Apr 11, 2024

I linked linkerd/linkerd2-proxy-init#362 by mistake. That should be unrelated unless you're using native sidecars too.
I've published the image ghcr.io/alpeb/cni-plugin:modify with the change from linkerd/linkerd2-proxy-init#360. It would be great if you could give that a try.

@matthiasdeblock
Copy link
Author

Hi
I have tested the image you provided but it still throws me the same error:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-6c49f479d8-946ww -c linkerd-network-validator
2024-04-19T09:51:52.016640Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-04-19T09:51:52.016683Z DEBUG linkerd_network_validator: token="CMiU50KsdnCBztqVH5xUXcVHqbfhqE960BJEpwoj5GTJLiftg9qQJ3JmT6KLssx\n"
2024-04-19T09:51:52.016689Z  INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-04-19T09:51:52.018141Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.71.143:44382
2024-04-19T09:52:02.017854Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=10s

@matthiasdeblock
Copy link
Author

@mateiidavid , any news on this one?

@mateiidavid
Copy link
Member

@matthiasdeblock sorry, I think this was closed automatically when I hit the merge button on the PR above. Since it did not fix your issue, I'm going to re-open this.

@matthiasdeblock
Copy link
Author

Hi @mateiidavid
Any news on this one?

I have changed the timeout from 10s to 60s and now I am getting a different error:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-f7b89b9db-qjxb7 -c linkerd-network-validator -f
2024-06-06T09:52:05.455607Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-06-06T09:52:05.455666Z DEBUG linkerd_network_validator: token="8NAaWTB0bQ7E5FcrUPpyWs8OOdpq1xnlMJElrWZ9RrN3ssRWdPSvVVBDwnykGOQ\n"
2024-06-06T09:52:05.455762Z  INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-06-06T09:52:05.456775Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.69.107:47754
2024-06-06T09:52:37.458317Z DEBUG connect: linkerd_network_validator: Read message from server bytes=0
2024-06-06T09:52:37.458513Z DEBUG linkerd_network_validator: data="" size=0
2024-06-06T09:52:37.458543Z ERROR linkerd_network_validator: error=expected client to receive "8NAaWTB0bQ7E5FcrUPpyWs8OOdpq1xnlMJElrWZ9RrN3ssRWdPSvVVBDwnykGOQ\n"; got "" instead

So, it is still the same connecting address 172.24.214.93:6443 which is our kubernetes-api but it is now throwing another error...

Thank you!
Regards
Matthias

@matthiasdeblock
Copy link
Author

Hi

I have changed linkerd to the latest edge-24.5.5 and CNI to 1.5.0. Also have been putting the timeout to 30s. Still the same issue:

flatcar-k8stest-master-01 ~ # kubectl -n linkerd logs linkerd-destination-749d567f64-rnmhl -c linkerd-network-validator -f
2024-06-06T12:02:02.055672Z  INFO linkerd_network_validator: Listening for connections on 0.0.0.0:4140
2024-06-06T12:02:02.055715Z DEBUG linkerd_network_validator: token="FxnawK939yIxs5SAvEnQ9ii4QLecvKoWZRgGRMgOcrzwwRaWCyIbaxzorU79K5G\n"
2024-06-06T12:02:02.055729Z  INFO linkerd_network_validator: Connecting to 172.24.214.93:6443
2024-06-06T12:02:02.057521Z DEBUG connect: linkerd_network_validator: Connected client.addr=10.12.69.211:47982
2024-06-06T12:02:32.057580Z ERROR linkerd_network_validator: Failed to validate networking configuration. Please ensure iptables rules are rewriting traffic as expected. timeout=30s

@matthiasdeblock
Copy link
Author

matthiasdeblock commented Jun 7, 2024

Hi

I've been looking into this myself a bit better and I found the issue here. It seems like the Cilium needed this config:

cni.exclusive=false

cni-exclusive: "false"

What this means: make Cilium take ownership over the /etc/cni/net.d directory on the node, renaming all non-Cilium CNI configurations to *.cilium_bak. This ensures no Pods can be scheduled using other CNI plugins during Cilium agent downtime.

Source: https://docs.cilium.io/en/stable/helm-reference/

@alpeb
Copy link
Member

alpeb commented Jun 25, 2024

Thanks for the feedback @matthiasdeblock ! I've confirmed the fix and pushed some updates to our docs.

alpeb added a commit to linkerd/website that referenced this issue Jul 8, 2024
* Add notes about Cilium's exclusive mode

Closes linkerd/linkerd2#11597

Co-authored-by: Flynn <kflynn@users.noreply.github.com>
Co-authored-by: William Morgan <william@buoyant.io>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants