Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SecondaryNetwork OVS ports cannot be deleted correctly after an Agent restart #6578

Open
antoninbas opened this issue Jul 31, 2024 · 1 comment
Labels
area/secondary-network Issues or PRs related to support for secondary networks in Antrea kind/bug Categorizes issue or PR as related to a bug.

Comments

@antoninbas
Copy link
Contributor

Describe the bug
After creating a Pod with an Antrea SecondaryNetwork interface and restarting the Antrea Agent (e.g. planned restart as part of an upgrade), deleting the Pod will not remove the OVS corresponding port on the secondary bridge.

To Reproduce

First prepare a Kind cluster with Antrea installed and SecondaryNetwork configured.
The easiest way to do that is to modify the ci/kind/test-secondary-network-kind.sh as follows:

diff --git a/ci/kind/test-secondary-network-kind.sh b/ci/kind/test-secondary-network-kind.sh
index 6d437ab21..bc411de14 100755
--- a/ci/kind/test-secondary-network-kind.sh
+++ b/ci/kind/test-secondary-network-kind.sh
@@ -116,7 +116,8 @@ function run_test {
   sleep 5
   kubectl apply -f $ATTACHMENT_DEFINITION_YAML
   kubectl apply -f $SECONDARY_NETWORKS_YAML
-
+  echo "READY"
+  sleep 3600
   go test -v -timeout=$TIMEOUT antrea.io/antrea/test/e2e-secondary-network -run=TestVLANNetwork -provider=kind $TEST_OPTIONS
 }

With this diff applied, the script will configure a Kind testbed with Antrea and all the necessary resources, but it will not run e2e tests and it will wait for one hour before cleaning up the testbed.

When READY is displayed, the testbed is ready, and you can follow these steps:

  1. Create a test Pod
apiVersion: v1
kind: Pod
metadata:
 name: sample-pod
 labels:
   app: antrea-secondary-network-demo
 annotations:
   k8s.v1.cni.cncf.io/networks: vlan-net1@eth100
spec:
 containers:
 - name: toolbox
   image: antrea/toolbox:latest
  1. Check that the eth100 interface is created correctly for the Pod
$ kubectl exec -ti sample-pod -- ip addr show eth100
3: eth100@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1200 qdisc noqueue state UP group default
    link/ether 9a:1a:8f:58:2d:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 148.14.24.2/24 brd 148.14.24.255 scope global eth100
       valid_lft forever preferred_lft forever
  1. You can also check the OVS configuration on the Node
$ kubectl -n kube-system exec -ti <agent name> -c antrea-ovs -- ovs-vsctl show
d7830773-4e40-423e-aee3-7907213c0ee6
    Bridge br-secondary
        datapath_type: system
        Port "eth1~"
            Interface "eth1~"
        Port sample-p-162ce1
            tag: 100
            Interface sample-p-162ce1
        Port eth1
            Interface eth1
                type: internal
    Bridge br-int
        datapath_type: system
        Port sample-p-acccb0
            Interface sample-p-acccb0
        Port local-pa-0e1213
            Interface local-pa-0e1213
        Port antrea-tun0
            Interface antrea-tun0
                type: geneve
                options: {key=flow, remote_ip=flow}
        Port antrea-gw0
            Interface antrea-gw0
                type: internal
    ovs_version: "2.17.7"
  1. Delete the antrea-agent Pod on the Node where the Pod is running, to cause a restart
  2. Delete the test Pod (sample-pod)
  3. Check the OVS configuration again.

Expected
The OVS port (sample-p-<X>) should have been deleted.

Actual behavior
The OVS port is still present and in an error state:

$ kubectl -n kube-system exec -ti <agent name> -c antrea-ovs -- ovs-vsctl show
d7830773-4e40-423e-aee3-7907213c0ee6
    Bridge br-secondary
        datapath_type: system
        Port "eth1~"
            Interface "eth1~"
        Port sample-p-162ce1
            tag: 100
            Interface sample-p-162ce1
                error: "could not open network device sample-p-162ce1 (No such device)"
        Port eth1
            Interface eth1
                type: internal
    Bridge br-int
        datapath_type: system
        Port local-pa-0e1213
            Interface local-pa-0e1213
        Port antrea-tun0
            Interface antrea-tun0
                type: geneve
                options: {key=flow, remote_ip=flow}
        Port antrea-gw0
            Interface antrea-gw0
                type: internal
    ovs_version: "2.17.7"

The Pod netns has been deleted and so the veth interface was also deleted. So we have an OVS port that is still present but uses an invalid interface.

Versions:
Antrea v2.1.0, as well as the current top-of-tree.

Additional context
This is part of a wider issue where the state of the SecondaryNetwork Controller and InterfaceStore are not restored properly when the Agent restarts.

cc @jianjuns

@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. area/secondary-network Issues or PRs related to support for secondary networks in Antrea labels Jul 31, 2024
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024
@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/secondary-network Issues or PRs related to support for secondary networks in Antrea kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant