Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

karmada controller reconcile performs performance optimization #5790

Open
CharlesQQ opened this issue Nov 6, 2024 · 5 comments
Open

karmada controller reconcile performs performance optimization #5790

CharlesQQ opened this issue Nov 6, 2024 · 5 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@CharlesQQ
Copy link
Contributor

CharlesQQ commented Nov 6, 2024

What would you like to be added:

func (d *ResourceDetector) LookForMatchedPolicy(object *unstructured.Unstructured, objectKey keys.ClusterWideKey) (*policyv1alpha1.PropagationPolicy, error) {
if len(objectKey.Namespace) == 0 {
return nil, nil
}
klog.V(2).Infof("Attempts to match policy for resource(%s)", objectKey)
policyObjects, err := d.propagationPolicyLister.ByNamespace(objectKey.Namespace).List(labels.Everything())
if err != nil {
klog.Errorf("Failed to list propagation policy: %v", err)
return nil, err
}
if len(policyObjects) == 0 {
klog.V(2).Infof("No propagationpolicy find in namespace(%s).", objectKey.Namespace)
return nil, nil
}
policyList := make([]*policyv1alpha1.PropagationPolicy, 0)
for index := range policyObjects {
policy := &policyv1alpha1.PropagationPolicy{}
if err = helper.ConvertToTypedObject(policyObjects[index], policy); err != nil {
klog.Errorf("Failed to convert PropagationPolicy from unstructured object: %v", err)
return nil, err
}
if !policy.DeletionTimestamp.IsZero() {
klog.V(4).Infof("Propagation policy(%s/%s) cannot match any resource template because it's being deleted.", policy.Namespace, policy.Name)
continue
}
policyList = append(policyList, policy)
}
return getHighestPriorityPropagationPolicy(policyList, object, objectKey), nil

  • In order to find the matching propagationPolicy, all propagationPolicies will be listed, and the for loop will be used to find the matching propagationPolicy in turn. The for loop here may take too long.
    The resource amount of pp and deployment is about 7000
  • In the function ConvertToTypedObject, runtime.DefaultUnstructuredConverter.FromUnstructured and runtime.DefaultUnstructuredConverter.FromUnstructured take a long time to perform type conversion; can this function call be removed?

Is there a better way to optimize the above problems?

Why is this needed:
As shown below, The resource_match_policy_duration_seconds_bucket metric indicates that the execution took more than 0.5 seconds or even 0.9 seconds; This may cause the execution time of the resource detector controller reconcile to be too long, and the workqueue queue to create a long backlog.
image
image

By looking at the pprof cpu profile, we found that the function ConvertToTypedObject takes a long time to execute.
image

@CharlesQQ CharlesQQ added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 6, 2024
@CharlesQQ CharlesQQ changed the title resource_match_policy takes too long time Resource Detector controller reconcile performs performance optimization Nov 8, 2024
@CharlesQQ
Copy link
Contributor Author

CharlesQQ commented Nov 11, 2024

By removing the outermost for loop and the call to the function ConvertToTypedObject, after testing, it can be seen that the performance of the resource detector controller is significantly optimized.

We use a 4-core CPU machine as the master node; create 1000 deployments and pp:

Before: cost 5 minutes and 45 seconds

After: cost 2 minutes and 30 seconds
image

image

@RainbowMango
Copy link
Member

/assign @CharlesQQ
In favor of #5802

@CharlesQQ
Copy link
Contributor Author

CharlesQQ commented Nov 12, 2024

Adjusting the parameter --concurrent-resourcebinding-syncs from 5 to 30, the resourcebinding controller's queue backlog optimization effect is not obvious, and it takes almost 5 minutes.

The problem occurs in the ensureWork function
The following is my time-consuming log for each function

  • ApplyOverridePolicies: 1.025481204+1.181936987 = 2.2074181910000004, cost accounting for 79%
    • Client.List Policies: 0.980144234 + 0.811084427, cost accounting for 64%
  • CreateOrUpdateWork: 0.426976309+0.048349991 = 0.4753263, cost accounting for 17%
I1113 14:36:11.161745      18 overridemanager.go:181] applyNamespacedOverrides, list configmap-beta-moa-demo-prod-681,  cluster member-cluster1, cost: 0.980144234
I1113 14:36:11.198787      18 overridemanager.go:217] getOverridersFromOverridePolicies  cluster: member-cluster1  name: configmap-beta-moa-demo-prod-681    cost: 300.506µs
I1113 14:36:11.206905      18 overridemanager.go:98] ApplyOverridePolicies configmap-beta-moa-demo-prod-681, cluster member-cluster1, cost: 1.025481204
I1113 14:36:11.392423      18 work.go:101] RetryOnConflict CreateOrUpdate workload configmap-beta-moa-demo-prod-681, namespace karmada-es-member-cluster1, cost 0.18521976
I1113 14:36:11.633971      18 common.go:151] CreateOrUpdateWork configmap-beta-moa-demo-prod-681,   cost: 0.426976309
I1113 14:36:12.445315      18 overridemanager.go:181] applyNamespacedOverrides, list configmap-beta-moa-demo-prod-681,  cluster member-cluster2, cost: 0.811084427
I1113 14:36:12.532025      18 overridemanager.go:217] getOverridersFromOverridePolicies  cluster: member-cluster2  name: configmap-beta-moa-demo-prod-681    cost: 4.289999ms
I1113 14:36:12.815987      18 overridemanager.go:98] ApplyOverridePolicies configmap-beta-moa-demo-prod-681, cluster member-cluster2, cost: 1.181936987
I1113 14:36:12.864306      18 work.go:101] RetryOnConflict CreateOrUpdate workload configmap-beta-moa-demo-prod-681, namespace karmada-es-member-cluster2, cost 0.048121342
I1113 14:36:12.864415      18 common.go:151] CreateOrUpdateWork configmap-beta-moa-demo-prod-681,   cost: 0.048349991
I1113 14:36:12.864438      18 common.go:153] End for range configmap-beta-moa-demo-prod-681-configmap, cost: 2.683047699
I1113 14:36:12.864466      18 binding_controller.go:133] Ensure work configmap-beta-moa-demo-prod-681-configmap, cost: 2.6830823759999998
I1113 14:36:12.864529      18 binding_controller.go:74] ResourceBinding reconcile  default/configmap-beta-moa-demo-prod-681-configmap cost: 2.790846222

if seems that deepcopy cost the most time:
image

image image image

After turning off list deepcopy, the list time is reduced to 0.1s or even lower.

I1113 15:43:39.212238      17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-283,  cluster member-cluster1, cost: 0.040210384
I1113 15:43:39.251189      17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-97,  cluster member-cluster2, cost: 0.018000362
I1113 15:43:39.314731      17 overridemanager.go:182] applyNamespacedOverrides, list beta-moa-demo-prod-283,  cluster member-cluster2, cost: 0.023638913
image

@CharlesQQ CharlesQQ changed the title Resource Detector controller reconcile performs performance optimization karmada controller reconcile performs performance optimization Nov 13, 2024
@CharlesQQ
Copy link
Contributor Author

For the resource detector controller, by adjusting the parameter concurrent-resource-template-syncs=60, the queue backlog is reduced from 16 minutes to less than 1 minute.

@CharlesQQ
Copy link
Contributor Author

CharlesQQ commented Nov 22, 2024

By setting the parameters --kube-api-qps=200 --kube-api-burst=300, bind-controller's queue backlog is reduced to 1 minute and 15 seconds.
However, it is not possible to completely eliminate the client speed limit

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: No status
Development

No branches or pull requests

2 participants