Fix race condition in strict mode #306

Pavani-Panakanti · 2024-09-20T01:48:07Z

Issue #, if available:
Fixing a race condition where shared ebpf maps are being deleted in strict mode

Description of changes:
Added new map to store progFD to pods mapping which will help us check if progFD's are being shared between pods

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

achevuru · 2024-09-25T05:31:34Z

Generic Q: How is the BPF prog delete going through when the said prog is still (loaded &) attached to another active pod interface on the same node?

achevuru · 2024-09-25T05:35:24Z

controllers/policyendpoints_controller.go

@@ -197,6 +197,33 @@ func (r *PolicyEndpointsReconciler) cleanUpPolicyEndpoint(ctx context.Context, r
 	return nil
 }

+func (r *PolicyEndpointsReconciler) isProgFdShared(ctx context.Context, targetPodName string,


I think we can optimize this as a pod will always have both ingress and egress probes attached (i.e.,) there will never be a scenario where only one of the probes is attached for any particular pod. Probably can collapse the maps in to one unless there are plans to use them for something else.

@achevuru Yeah using one just one map to store reverse mapping from progFds to Pods list makes sense. I will just add the ingress one and we can remove the new egress map

We will update this optimization in a separate PR as this is change is blocking cx

Pavani-Panakanti · 2024-09-25T17:10:12Z

Generic Q: How is the BPF prog delete going through when the said prog is still (loaded &) attached to another active pod interface on the same node?

We are assuming it is not being shared by any other active pod by verifying policy endpoint list. But when a new pod is created and not yet reconciled in strict mode, we are attaching the probes and updating the maps but pod is not yet present in policy endpoint list https://github.com/aws/aws-network-policy-agent/blob/main/pkg/rpc/rpc_handler.go#L74

{"level":"info","ts":"2024-09-17T16:14:51.158Z","caller":"ebpf/bpf_client.go:724","msg":"This can be deleted, not needed anymore..."}
{"level":"info","ts":"2024-09-17T16:14:51.158Z","caller":"maps/loader.go:505","msg":"Delete map entry done with fd : 0 and err errno 0"}

jayanthvn · 2024-09-25T17:42:18Z

Nice to hear from you @achevuru :)

To summarize with Strict mode when we get pod 2 of the same replica while we have pod 1 on the node. We check there is pod 1 and reuse the FDs and update the local data structures i.e, pod -> FDs mapping. Now the issue happens if there is a pod 1 delete at the same time and the PE reconcile is out of order i.e, we get pod 1 delete in the first reconciliation loop and then pod 2 add in the second reconciliation loop. In the first reconciliation loop we don't do additional check (shared replicas) and mark the pod 1 and bpf file to be deleted. Now when second reconciliation is received we just check the local data structure i.e, the mapping and skip loading bpf prog/map..pod 2 ENI will have a probe with an invalid FD..

jayanthvn · 2024-09-25T18:00:14Z

controllers/policyendpoints_controller.go

@@ -211,11 +238,10 @@ func (r *PolicyEndpointsReconciler) updatePolicyEnforcementStatusForPods(ctx con
 		deletePinPath := true
 		podIdentifier := utils.GetPodIdentifier(targetPod.Name, targetPod.Namespace, r.log)


Do we still need podIdentifier? Seems like only used in the below log?

Yeah we can remove that one. I will add optimization and clean up of this line in next PR

jayanthvn

We can do the cleanup post merge. We should also have a way to dump all these internal structures maybe via cli to debug any issues..

Pavani-Panakanti · 2024-09-25T18:14:32Z

We can do the cleanup post merge. We should also have a way to dump all these internal structures maybe via cli to debug any issues..

Agree. Logging these structs will help us in debugging. I will add a task and follow up on this

Fix race condition in strict mode

974fcfb

Pavani-Panakanti requested a review from a team as a code owner September 20, 2024 01:48

Pavani-Panakanti added 2 commits September 20, 2024 02:19

fix formatting

2a051f4

Fix test case

410aa1e

achevuru reviewed Sep 25, 2024

View reviewed changes

jayanthvn reviewed Sep 25, 2024

View reviewed changes

jayanthvn approved these changes Sep 25, 2024

View reviewed changes

Pavani-Panakanti merged commit fb4dd87 into aws:main Sep 25, 2024
4 checks passed

Pavani-Panakanti mentioned this pull request Oct 2, 2024

Determine if network policies are applied #298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in strict mode #306

Fix race condition in strict mode #306

Pavani-Panakanti commented Sep 20, 2024

achevuru commented Sep 25, 2024

achevuru Sep 25, 2024

Pavani-Panakanti Sep 25, 2024

Pavani-Panakanti Sep 25, 2024

Pavani-Panakanti commented Sep 25, 2024

jayanthvn commented Sep 25, 2024

jayanthvn Sep 25, 2024

Pavani-Panakanti Sep 25, 2024

jayanthvn left a comment

Pavani-Panakanti commented Sep 25, 2024

		@@ -211,11 +238,10 @@ func (r *PolicyEndpointsReconciler) updatePolicyEnforcementStatusForPods(ctx con
		deletePinPath := true
		podIdentifier := utils.GetPodIdentifier(targetPod.Name, targetPod.Namespace, r.log)

Fix race condition in strict mode #306

Fix race condition in strict mode #306

Conversation

Pavani-Panakanti commented Sep 20, 2024

achevuru commented Sep 25, 2024

achevuru Sep 25, 2024

Choose a reason for hiding this comment

Pavani-Panakanti Sep 25, 2024

Choose a reason for hiding this comment

Pavani-Panakanti Sep 25, 2024

Choose a reason for hiding this comment

Pavani-Panakanti commented Sep 25, 2024

jayanthvn commented Sep 25, 2024

jayanthvn Sep 25, 2024

Choose a reason for hiding this comment

Pavani-Panakanti Sep 25, 2024

Choose a reason for hiding this comment

jayanthvn left a comment

Choose a reason for hiding this comment

Pavani-Panakanti commented Sep 25, 2024