feat: More optimal IterateHierarchyV2 and iterateChildrenV2 [#600] #601

andrii-korotkov-verkada · 2024-07-04T20:29:39Z

Closes #600

The existing (effectively v1) implementations are suboptimal since they don't construct a graph before the iteration. They search for children by looking at all namespace resources and checking isParentOf, which can give O(tree_size * namespace_resources_count) time complexity. The v2 algorithms construct the graph and have O(namespace_resources_count) time complexity. See more details in the linked issues.

andrii-korotkov-verkada · 2024-07-07T20:51:55Z

Testing with ArgoCD argoproj/argo-cd#18972

…j#600] Closes argoproj#600 The existing (effectively v1) implementations are suboptimal since they don't construct a graph before the iteration. They search for children by looking at all namespace resources and checking `isParentOf`, which can give `O(tree_size * namespace_resources_count)` time complexity. The v2 algorithms construct the graph and have `O(namespace_resources_count)` time complexity. See more details in the linked issues. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

andrii-korotkov-verkada · 2024-07-07T23:31:21Z

Looks really good on live cluster. ~300ms instead of almost ~4m for the same application!

andrii-korotkov-verkada · 2024-07-09T22:51:19Z

Here are some perf views of the system collected following argoproj/argo-cd#13534 (comment).

The build is from master on 2024/07/07 including argoproj/argo-cd#18972, argoproj/argo-cd#18694, #601, #603.

pkg/cache/cluster.go

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

…j#600] Closes argoproj#600 The existing (effectively v1) implementations are suboptimal since they don't construct a graph before the iteration. They search for children by looking at all namespace resources and checking `isParentOf`, which can give `O(tree_size * namespace_resources_count)` time complexity. The v2 algorithms construct the graph and have `O(namespace_resources_count)` time complexity. See more details in the linked issues. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

…hy-and-iterate-children' into iterate-improvements Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

pkg/cache/cluster.go

codecov · 2024-07-16T23:40:34Z

Codecov Report

Attention: Patch coverage is 82.79570% with 16 lines in your changes missing coverage. Please review.

Project coverage is 58.38%. Comparing base (fa0e8d6) to head (905c87e).
Report is 3 commits behind head on master.

Files	Patch %	Lines
pkg/cache/cluster.go	86.11%	6 Missing and 4 partials ⚠️
pkg/cache/resource.go	71.42%	3 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #601      +/-   ##
==========================================
+ Coverage   55.91%   58.38%   +2.46%     
==========================================
  Files          42       42              
  Lines        4900     5008     +108     
==========================================
+ Hits         2740     2924     +184     
+ Misses       1937     1840      -97     
- Partials      223      244      +21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/cache/cluster.go

…j#600] Closes argoproj#600 The existing (effectively v1) implementations are suboptimal since they don't construct a graph before the iteration. They search for children by looking at all namespace resources and checking `isParentOf`, which can give `O(tree_size * namespace_resources_count)` time complexity. The v2 algorithms construct the graph and have `O(namespace_resources_count)` time complexity. See more details in the linked issues. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>

crenshaw-dev · 2024-07-16T23:50:20Z

Could you take a look at this PR? andrii-korotkov-verkada#1

I'll rebase, the force-push messed up the diff.

…hy-and-iterate-children' into iterate-improvements Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

crenshaw-dev · 2024-07-16T23:53:34Z

Pushed.

improvements to graph building

andrii-korotkov-verkada · 2024-07-17T00:14:55Z

@crenshaw-dev, looks good! Thanks for improving the code. I've merged the changes.

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

crenshaw-dev · 2024-07-17T00:22:59Z

@andrii-korotkov-verkada thanks! I may have a few more as I continue digging through the code. Bear with me. I plan to stick with it this week until we get it merged, as long as you have time to keep working on it!

crenshaw-dev · 2024-07-17T00:27:17Z

More fun: andrii-korotkov-verkada#2

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

discard unneeded copies of child resources as we go

andrii-korotkov-verkada · 2024-07-17T00:38:25Z

Sounds good! Thanks for the help. I'd be pretty active on this.

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

crenshaw-dev · 2024-07-17T01:35:32Z

pkg/cache/cluster.go

+func buildGraph(nsNodes map[kube.ResourceKey]*Resource) (map[kube.ResourceKey][]kube.ResourceKey, map[kube.ResourceKey]map[types.UID]*Resource) {
+	// Prepare to construct a graph
+	nodesByUID := make(map[types.UID][]*Resource, len(nsNodes))
+	nodeByGraphKey := make(map[graphKey]*Resource, len(nsNodes))


Do we really need nodeByGraphKey, or could we just use nsNodes, since all the resources should be in the same namespace?

Using nsNodes passes gitops-engine unit tests and saves a lot of memory, but I'm skeptical. I can put up a PoC tomorrow to analyze.

nodeByGraphKey is for efficient node lookup during uid backfill. I don't know if we avoid it.

It's slightly different comparing to the kube resource key, e.g. uses api version instead of group.

I don't know the original reasoning for this distinction, but left it for backwards compatibility.

Yeah of all the memory optimizations I looked at, this one would worry me the most. But its 25MB -> 17MB, which gets it pretty close to the memory footprint of IterateHierarchy v1.

I'll put up the PoC to look at and run Argo CD unit tests, but let's assume we're sticking with nodeByGraphKey unless we're super satisfied with the idea of switching to nsNodes.

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

crenshaw-dev · 2024-07-17T01:54:23Z

Probably the last graph building optimization: andrii-korotkov-verkada#3

make childrenByUID sparse

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

crenshaw-dev · 2024-07-18T16:44:00Z

@andrii-korotkov-verkada up for considering this one? andrii-korotkov-verkada#4

I think we do risk missing parent relationships due to Resources lacking the correct namespace field. But so far unit tests, Argo CD e2e tests, and running in an Intuit system looks okay.

I think it would be worth the risk for cutting the memory use in half and saving ~30% execution time.

andrii-korotkov-verkada · 2024-07-18T17:01:43Z

In general it'd make sense to me since all other places use group. I'll merge it.

use nsNodes instead of dupe map

crenshaw-dev · 2024-07-18T17:52:41Z

Test results from Intuit:

Before, IterateHierarchy was taking about 10% of the application controller's time while refreshing apps. Now it takes around 1%. So roughly 6s before, now less than 1s out of a 60s profile.

Heap use has gone up by about 2x, but it wasn't memory-heavy before, and it hasn't really increased in a problematic way.

Steady-state CPU and memory is basically the same as before.

I used log metrics to measure 95th percentile reconciliation, getting the resource tree, and setting managed resources.

Reconciliation time is about the same, maaaaybe 25% faster. Getting the resource tree is about the same amount of time. Setting the managed resources now takes ~one fifth the time it did before, down to 200ms from 1000ms.

I think the vast majority of that improvement doesn't actually come from the graph pre-build optimization, but instead comes from avoiding iterating over all the managed resources (an optimization which doesn't actually depend on pre-building the graph).

In summary: my test showed no performance regressions, no functionality regressions, and maybe a slight performance improvement related to the new algorithm. I suspect the relatively small improvement is because we have relatively few resources per-namespace. Bigger namespaces will have a bigger CPU win and (probably) a higher memory cost.

andrii-korotkov-verkada · 2024-07-18T17:56:54Z

Thanks for all the testing and support! Yeah, I guess we just have quite large default namespace (which isn't best practice, but here we are), hence I saw larger wins. Longest running processor in the largest cluster went down from 30-60min to 1-2min.

…#600] (argoproj#601) * chore: More optimal IterateHierarchyV2 and iterateChildrenV2 [argoproj#600] Closes argoproj#600 The existing (effectively v1) implementations are suboptimal since they don't construct a graph before the iteration. They search for children by looking at all namespace resources and checking `isParentOf`, which can give `O(tree_size * namespace_resources_count)` time complexity. The v2 algorithms construct the graph and have `O(namespace_resources_count)` time complexity. See more details in the linked issues. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> * improvements to graph building Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * use old name Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * chore: More optimal IterateHierarchyV2 and iterateChildrenV2 [argoproj#600] Closes argoproj#600 The existing (effectively v1) implementations are suboptimal since they don't construct a graph before the iteration. They search for children by looking at all namespace resources and checking `isParentOf`, which can give `O(tree_size * namespace_resources_count)` time complexity. The v2 algorithms construct the graph and have `O(namespace_resources_count)` time complexity. See more details in the linked issues. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> * finish merge Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * chore: More optimal IterateHierarchyV2 and iterateChildrenV2 [argoproj#600] Closes argoproj#600 The existing (effectively v1) implementations are suboptimal since they don't construct a graph before the iteration. They search for children by looking at all namespace resources and checking `isParentOf`, which can give `O(tree_size * namespace_resources_count)` time complexity. The v2 algorithms construct the graph and have `O(namespace_resources_count)` time complexity. See more details in the linked issues. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> * discard unneeded copies of child resources as we go Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * remove unnecessary comment Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * make childrenByUID sparse Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * eliminate duplicate map Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * fix comment Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * add useful comment back Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * use nsNodes instead of dupe map Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * remove unused struct Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> * skip invalid APIVersion Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> --------- Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com> Co-authored-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

andrii-korotkov-verkada mentioned this pull request Jul 4, 2024

Optimize processing managed resources in getResourceTree, ideally from quadratic to linear argoproj/argo-cd#18929

Closed

andrii-korotkov-verkada force-pushed the 600-more-optimal-iterate-hierarchy-and-iterate-children branch 2 times, most recently from b0f2923 to 6279c76 Compare July 7, 2024 00:48

andrii-korotkov-verkada force-pushed the 600-more-optimal-iterate-hierarchy-and-iterate-children branch from 6279c76 to d777c9a Compare July 7, 2024 21:37

This was referenced Jul 9, 2024

chore: Use more optimal iterate hierarchy v2 (#18929) argoproj/argo-cd#18972

Merged

fix: Put app to the operation queue after refresh queue processing to avoid race condition [#18500] argoproj/argo-cd#18694

Merged

This was referenced Jul 9, 2024

chore: Optimize usage of locking in the cluster [#602] #603

Closed

App of apps regularly gets out of sync when some apps have explictly specified empty sync policy argoproj/argo-cd#19043

Open

crenshaw-dev reviewed Jul 16, 2024

View reviewed changes

pkg/cache/cluster.go Outdated Show resolved Hide resolved

crenshaw-dev and others added 3 commits July 16, 2024 17:16

improvements to graph building

39bba43

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

use old name

120afb4

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

andrii-korotkov-verkada force-pushed the 600-more-optimal-iterate-hierarchy-and-iterate-children branch from d777c9a to 335ff88 Compare July 16, 2024 21:24

crenshaw-dev added 2 commits July 16, 2024 17:31

Merge remote-tracking branch 'andrii/600-more-optimal-iterate-hierarc…

1efd9be

…hy-and-iterate-children' into iterate-improvements Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

finish merge

0fb5064

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

crenshaw-dev reviewed Jul 16, 2024

View reviewed changes

pkg/cache/cluster.go Show resolved Hide resolved

andrii-korotkov-verkada force-pushed the 600-more-optimal-iterate-hierarchy-and-iterate-children branch from 335ff88 to 905c87e Compare July 16, 2024 23:39

crenshaw-dev reviewed Jul 16, 2024

View reviewed changes

pkg/cache/cluster.go Outdated Show resolved Hide resolved

andrii-korotkov-verkada force-pushed the 600-more-optimal-iterate-hierarchy-and-iterate-children branch from 905c87e to af08910 Compare July 16, 2024 23:47

Merge remote-tracking branch 'andrii/600-more-optimal-iterate-hierarc…

1d56552

…hy-and-iterate-children' into iterate-improvements Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

Merge pull request #1 from crenshaw-dev/iterate-improvements

837b536

improvements to graph building

discard unneeded copies of child resources as we go

703a60d

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

crenshaw-dev and others added 2 commits July 16, 2024 20:29

remove unnecessary comment

19aa0bf

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

Merge pull request #2 from crenshaw-dev/iterate-improvements

0f77c57

discard unneeded copies of child resources as we go

crenshaw-dev added 4 commits July 16, 2024 20:41

make childrenByUID sparse

38701d0

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

eliminate duplicate map

8284fb0

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

fix comment

5c23ab5

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

add useful comment back

8dbcf05

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

crenshaw-dev reviewed Jul 17, 2024

View reviewed changes

use nsNodes instead of dupe map

3ef3651

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

andrii-korotkov-verkada and others added 3 commits July 16, 2024 18:58

Merge pull request #3 from crenshaw-dev/iterate-improvements

9a98c83

make childrenByUID sparse

remove unused struct

e2fb782

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

skip invalid APIVersion

0b6e366

Signed-off-by: Michael Crenshaw <350466+crenshaw-dev@users.noreply.github.com>

Merge pull request #4 from crenshaw-dev/reuse-nsnodes

d162159

use nsNodes instead of dupe map

crenshaw-dev changed the title ~~chore: More optimal IterateHierarchyV2 and iterateChildrenV2 [#600]~~ feat: More optimal IterateHierarchyV2 and iterateChildrenV2 [#600] Jul 18, 2024

crenshaw-dev approved these changes Jul 18, 2024

View reviewed changes

crenshaw-dev merged commit 6b2984e into argoproj:master Jul 18, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: More optimal IterateHierarchyV2 and iterateChildrenV2 [#600] #601

feat: More optimal IterateHierarchyV2 and iterateChildrenV2 [#600] #601

andrii-korotkov-verkada commented Jul 4, 2024

andrii-korotkov-verkada commented Jul 7, 2024

andrii-korotkov-verkada commented Jul 7, 2024

andrii-korotkov-verkada commented Jul 9, 2024 •

edited

Loading

codecov bot commented Jul 16, 2024

crenshaw-dev commented Jul 16, 2024

crenshaw-dev commented Jul 16, 2024

andrii-korotkov-verkada commented Jul 17, 2024

crenshaw-dev commented Jul 17, 2024

crenshaw-dev commented Jul 17, 2024

andrii-korotkov-verkada commented Jul 17, 2024

crenshaw-dev Jul 17, 2024

crenshaw-dev Jul 17, 2024 •

edited

Loading

andrii-korotkov-verkada Jul 17, 2024

andrii-korotkov-verkada Jul 17, 2024

andrii-korotkov-verkada Jul 17, 2024

crenshaw-dev Jul 17, 2024

crenshaw-dev commented Jul 17, 2024

crenshaw-dev commented Jul 18, 2024

andrii-korotkov-verkada commented Jul 18, 2024

crenshaw-dev commented Jul 18, 2024

andrii-korotkov-verkada commented Jul 18, 2024

feat: More optimal IterateHierarchyV2 and iterateChildrenV2 [#600] #601

feat: More optimal IterateHierarchyV2 and iterateChildrenV2 [#600] #601

Conversation

andrii-korotkov-verkada commented Jul 4, 2024

andrii-korotkov-verkada commented Jul 7, 2024

andrii-korotkov-verkada commented Jul 7, 2024

andrii-korotkov-verkada commented Jul 9, 2024 • edited Loading

codecov bot commented Jul 16, 2024

Codecov Report

crenshaw-dev commented Jul 16, 2024

crenshaw-dev commented Jul 16, 2024

andrii-korotkov-verkada commented Jul 17, 2024

crenshaw-dev commented Jul 17, 2024

crenshaw-dev commented Jul 17, 2024

andrii-korotkov-verkada commented Jul 17, 2024

crenshaw-dev Jul 17, 2024

Choose a reason for hiding this comment

crenshaw-dev Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

andrii-korotkov-verkada Jul 17, 2024

Choose a reason for hiding this comment

andrii-korotkov-verkada Jul 17, 2024

Choose a reason for hiding this comment

andrii-korotkov-verkada Jul 17, 2024

Choose a reason for hiding this comment

crenshaw-dev Jul 17, 2024

Choose a reason for hiding this comment

crenshaw-dev commented Jul 17, 2024

crenshaw-dev commented Jul 18, 2024

andrii-korotkov-verkada commented Jul 18, 2024

crenshaw-dev commented Jul 18, 2024

andrii-korotkov-verkada commented Jul 18, 2024

andrii-korotkov-verkada commented Jul 9, 2024 •

edited

Loading

crenshaw-dev Jul 17, 2024 •

edited

Loading