-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Put app to the operation queue after refresh queue processing to avoid race condition [#18500] #18694
fix: Put app to the operation queue after refresh queue processing to avoid race condition [#18500] #18694
Conversation
❌ Preview Environment deleted from BunnyshellAvailable commands (reply to this comment):
|
eac6532
to
e147e63
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #18694 +/- ##
==========================================
- Coverage 50.71% 50.71% -0.01%
==========================================
Files 316 316
Lines 43495 43495
==========================================
- Hits 22059 22057 -2
- Misses 18922 18927 +5
+ Partials 2514 2511 -3 ☔ View full report in Codecov by Sentry. |
I've deployed changes in a live environment and found that config map is not being waited at all - it's created and sync is progressing fast to next steps! Note, there's been some network errors, like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good to me. It makes sense to have the "operation" updated after the refresh. On a performance level, I dont think it will change much the quantity of refresh/operation items processed.
I think the current code relied on app refreshes being triggered very often and with optimizations in the latest version, these kinds timing problems reveal themselves!
@@ -2125,7 +2126,6 @@ func (ctrl *ApplicationController) newApplicationInformerAndLister() (cache.Shar | |||
key, err := cache.MetaNamespaceKeyFunc(obj) | |||
if err == nil { | |||
ctrl.appRefreshQueue.AddRateLimited(key) | |||
ctrl.appOperationQueue.AddRateLimited(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think on Application Add/Update/Delete, we should add to the operation queue as soon as possible, to have the sync operation up to date. Refresh/Sync may take a few seconds, so it is nice to have the value reflected as early as possible.
It does make sense to add to the operation queue when a refresh is completed IMO since it needs the latest status of the refresh (and sometimes the refresh is the one that starts the operation with auto-sync).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that with refresh and operation scheduled at the same time it could always make sense to wait for refresh to complete, since otherwise operation can happen on a stale data. In one case on update when refresh and operation were scheduled with a delay and another immediate operation scheduling was present, I’ve kept the later one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm putting it back, would update the PR shortly.
The current code seems to need both app refresh and app operation to update a lot, at least the later one. I think in the scenario I’ve encountered the config map is often the only new thing created/updated in the corresponding sync wave, so there might be not much going on to trigger another app operation. One more note is that we ignore resource updates on status field change, which might contribute, though config map itself doesn’t have a status. P.S. I’m writing from a personal account, currently left my work laptop at home. |
I've still encountered one case when it waited for ~5 minutes on a healthy state of a config map. Though in many cases it waits at most ~1 minute. |
This might be happening due to deduplication by queues if multiple refreshes are triggered in a quick succession. Maybe I can fix a part of it by adding a delay between refresh and app operation enqueuing, but not sure what to do with multiple refreshes. I'm not sure if the deduplication would happen if task done wasn't called yet but the processing has already started. |
I've noticed that app refresh requests come in batches spread out by ~5-10ms or so. I'm thinking of adding a delay before adding to refresh queue to avoid a situation when first refresh picks up partial changes and next refreshes are getting deduplicated while the first one is still processing. Not 100% that's how dedup works, but assuming the worst case. I'll test with a delay. |
I don't see an entry in the logs for |
UPDATE: the below could be wrong. |
I actually was testing adding this annotation before this fix and it didn't work by itself. I don't recall if my test with a fix was on a branch with annotation or not. |
There's been another config map update labeled for a different application and ~30 sec delay before app refresh was requested by the original config map. I'll do more testing. |
In another test it took ~5 minutes - app refresh based on the config map of interest arrived late for unclear reason. |
Interesting that an update above happened a couple of seconds after a watch for ConfigMap got restarted. |
I did another experiment, this time it took ~3 min to trigger app refresh by config map update and no watch restart for ConfigMap happened during that time. I'm curious if amount of config maps and using lock during event processing can contribute to slowness. |
This is not directly related to this PR, but I want to mention that we observe similar behavior on large clusters where changes to ConfigMaps (or other objects) are significantly delayed. The delay between when Argo CD receives watch events and when it processes them can be as long as 7 minutes or more. We've found that this issue is particularly pronounced in large clusters with frequently modified objects. Unfortunately, vertically scaling the application controller or tuning its parameters does not help. Splitting one cluster into multiple shards is also not feasible. The event processing logic appears to be a bottleneck. Here are some metrics showing how Argo CD choked and stopped processing events: I have a reproducible step for this issue, but I haven’t filed the issue yet. |
I wonder if some delays in processing are due to the same issue this PR tries to address - race condition and processing app operation on a stale data. If you can and want to, feel free to test this PR in your environment and see if it changes anything. From my observations so far, waiting time for config maps are usually within 1-3 minutes, though sometimes a bit longer like 5-7 minutes. But I don't see 10+ or 20+ minutes which was happening before. |
One more idea I have is to reduce the timeout of watches from 10min to 5min or so. Not entirely sure it'd help as much, but one case of update after watch just restarted hints that it might. |
With watch resync of 5 min most observed waiting finished within 1-3 min, rarely ~5 min. I think it's a good change. |
e147e63
to
73b7607
Compare
73b7607
to
d52bd53
Compare
I've updated the PR to revert a part of changes and rebased on latest master. |
@agaudreault, can you take another look at the updated PR when you have time, please? |
From a code review perspective, the code change seems to make sense. However, since this changes something in the main reconciliation/sync loop, I think it would be important to be able to visualize the effect that this change has. Apart from manual observations, are there any logs/metrics that would allow us to see the improvement and monitor the effect of the change? From the comments above, it would seem like there is a race condition, but this does not fix it, it only reduces its occurence. How can we find when the race condition happens? are there any observable data/logs that allows to detect it and how many times it happens? Without objective data, it is hard to judge if this change improves the sync behavior, without side effects. |
FWIW, the code has been running for us for a few weeks now and I haven't gotten new complaints about applications getting stuck waiting for a config map healthy state. |
f63d829
to
bac368c
Compare
I've added the e2e test for the scenario and also tested without my change in #19251. The test passes, I guess I'll do a few more tries without my change trying to make race condition to trigger, and probably add a loop. |
I've tried #19251 with 10 iterations and still didn't have it timeout. Seems like getting stuck is hard to reproduce. But at least the e2e test covers complete breakage of the flow. |
bac368c
to
4f79c58
Compare
I've added small optimization for a common case where delay on app update is 0 (default value) - removing extra putting in the app operation queue. I've also re-instanced additional putting to app queue on app addition with the same logic - first refresh and then do sync on fresher data. |
4f79c58
to
3b2dd37
Compare
I've also swapped the order of adding to app operation queue and marking the refresh task done. |
Can't repro the issue on master with a simple e2e test even when running 100 times. |
d425dd9
to
71a4913
Compare
… avoid race condition - Fixes [argoproj#18500] Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes. Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh. The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted. If not, an update can be made to have refresh queue storing a tuple with app key and bool whether to enqueue app operation. If there are issues: try keeping both old places to add to app operation queue and new addition after refresh. Note on cherry pick: add to as many releases as you can. This can be a significant performance boost. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
71a4913
to
5664b2f
Compare
A summary after recent updates:
@alexmt, please take a look when you have time and see if concerns were addressed well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Notes from our discussion. Adding app to both queues at the same time for a mistake, since definitely don't want to reconcile sync status and operation at the same time.
I'm not concerned about performance as well. The requestAppRefresh
is executed very frequently e.g. when child resource like Pod or ReplicaSet changes status. So it will reduce number of app operation reconciliations.
… avoid race condition - Fixes [argoproj#18500] (argoproj#18694) Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes. Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh. The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted. If not, an update can be made to have refresh queue storing a tuple with app key and bool whether to enqueue app operation. If there are issues: try keeping both old places to add to app operation queue and new addition after refresh. Note on cherry pick: add to as many releases as you can. This can be a significant performance boost. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
… avoid race condition - Fixes [argoproj#18500] (argoproj#18694) Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes. Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh. The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted. If not, an update can be made to have refresh queue storing a tuple with app key and bool whether to enqueue app operation. If there are issues: try keeping both old places to add to app operation queue and new addition after refresh. Note on cherry pick: add to as many releases as you can. This can be a significant performance boost. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com> Signed-off-by: Rhys Williams <rhys.williams@electrum.co.za>
… avoid race condition - Fixes [argoproj#18500] (argoproj#18694) Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes. Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh. The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted. If not, an update can be made to have refresh queue storing a tuple with app key and bool whether to enqueue app operation. If there are issues: try keeping both old places to add to app operation queue and new addition after refresh. Note on cherry pick: add to as many releases as you can. This can be a significant performance boost. Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
Helps with #18500
Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes. Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh.
The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling
argo-cd/controller/appcontroller.go
Line 2177 in 3e2cfb1
If there are issues: try keeping both old places to add to app operation queue and new addition after refresh.
Note on cherry pick: add to as many releases as you can. This can be a significant performance boost.
Checklist: