Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Put app to the operation queue after refresh queue processing to avoid race condition [#18500] #18694

Conversation

andrii-korotkov-verkada
Copy link
Contributor

@andrii-korotkov-verkada andrii-korotkov-verkada commented Jun 16, 2024

Helps with #18500

Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes. Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh.

The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling

ctrl.appRefreshQueue.Add(key)
. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted. If not, an update can be made to have refresh queue storing a tuple with app key and bool whether to enqueue app operation.

If there are issues: try keeping both old places to add to app operation queue and new addition after refresh.

Note on cherry pick: add to as many releases as you can. This can be a significant performance boost.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

@andrii-korotkov-verkada andrii-korotkov-verkada requested a review from a team as a code owner June 16, 2024 07:32
Copy link

bunnyshell bot commented Jun 16, 2024

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

Copy link

codecov bot commented Jun 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 50.71%. Comparing base (929e855) to head (5664b2f).
Report is 528 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #18694      +/-   ##
==========================================
- Coverage   50.71%   50.71%   -0.01%     
==========================================
  Files         316      316              
  Lines       43495    43495              
==========================================
- Hits        22059    22057       -2     
- Misses      18922    18927       +5     
+ Partials     2514     2511       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andrii-korotkov-verkada
Copy link
Contributor Author

andrii-korotkov-verkada commented Jun 21, 2024

I've deployed changes in a live environment and found that config map is not being waited at all - it's created and sync is progressing fast to next steps!

Note, there's been some network errors, like lookup argocd-repo-server on <ip>:<port>: no such host, but that's happening even on v2.11.3 and probably related to some network issues in our cluster.

Copy link
Member

@agaudreault agaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me. It makes sense to have the "operation" updated after the refresh. On a performance level, I dont think it will change much the quantity of refresh/operation items processed.

I think the current code relied on app refreshes being triggered very often and with optimizations in the latest version, these kinds timing problems reveal themselves!

@@ -2125,7 +2126,6 @@ func (ctrl *ApplicationController) newApplicationInformerAndLister() (cache.Shar
key, err := cache.MetaNamespaceKeyFunc(obj)
if err == nil {
ctrl.appRefreshQueue.AddRateLimited(key)
ctrl.appOperationQueue.AddRateLimited(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think on Application Add/Update/Delete, we should add to the operation queue as soon as possible, to have the sync operation up to date. Refresh/Sync may take a few seconds, so it is nice to have the value reflected as early as possible.

It does make sense to add to the operation queue when a refresh is completed IMO since it needs the latest status of the refresh (and sometimes the refresh is the one that starts the operation with auto-sync).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that with refresh and operation scheduled at the same time it could always make sense to wait for refresh to complete, since otherwise operation can happen on a stale data. In one case on update when refresh and operation were scheduled with a delay and another immediate operation scheduling was present, I’ve kept the later one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm putting it back, would update the PR shortly.

@andrii-korotkov
Copy link

Change looks good to me. It makes sense to have the "operation" updated after the refresh. On a performance level, I dont think it will change much the quantity of refresh/operation items processed.

I think the current code relied on app refreshes being triggered very often and with optimizations in the latest version, these kinds timing problems reveal themselves!

The current code seems to need both app refresh and app operation to update a lot, at least the later one. I think in the scenario I’ve encountered the config map is often the only new thing created/updated in the corresponding sync wave, so there might be not much going on to trigger another app operation. One more note is that we ignore resource updates on status field change, which might contribute, though config map itself doesn’t have a status.

P.S. I’m writing from a personal account, currently left my work laptop at home.

@andrii-korotkov-verkada
Copy link
Contributor Author

I've still encountered one case when it waited for ~5 minutes on a healthy state of a config map. Though in many cases it waits at most ~1 minute.

@andrii-korotkov-verkada
Copy link
Contributor Author

This might be happening due to deduplication by queues if multiple refreshes are triggered in a quick succession. Maybe I can fix a part of it by adding a delay between refresh and app operation enqueuing, but not sure what to do with multiple refreshes. I'm not sure if the deduplication would happen if task done wasn't called yet but the processing has already started.

@andrii-korotkov-verkada
Copy link
Contributor Author

I've noticed that app refresh requests come in batches spread out by ~5-10ms or so. I'm thinking of adding a delay before adding to refresh queue to avoid a situation when first refresh picks up partial changes and next refreshes are getting deduplicated while the first one is still processing. Not 100% that's how dedup works, but assuming the worst case. I'll test with a delay.

@andrii-korotkov-verkada
Copy link
Contributor Author

I don't see an entry in the logs for Requesting app refresh caused by object update when ConfigMap is created though, only several minutes later sometimes, which is concerning.

@andrii-korotkov-verkada
Copy link
Contributor Author

andrii-korotkov-verkada commented Jun 25, 2024

UPDATE: the below could be wrong.
Looks like missing resource tracking annotation app.kubernetes.io/instance was the reason. I just did a test with adding it to the ConfigMap together with this fix, and there's been a trigger of refresh triggered by config map shortly after it was created. It took less than 30 seconds of waiting for a healthy state of config map.

@andrii-korotkov-verkada
Copy link
Contributor Author

I actually was testing adding this annotation before this fix and it didn't work by itself. I don't recall if my test with a fix was on a branch with annotation or not.

@andrii-korotkov-verkada
Copy link
Contributor Author

There's been another config map update labeled for a different application and ~30 sec delay before app refresh was requested by the original config map. I'll do more testing.

@andrii-korotkov-verkada
Copy link
Contributor Author

In another test it took ~5 minutes - app refresh based on the config map of interest arrived late for unclear reason.

@andrii-korotkov-verkada
Copy link
Contributor Author

Interesting that an update above happened a couple of seconds after a watch for ConfigMap got restarted.

@andrii-korotkov-verkada
Copy link
Contributor Author

I did another experiment, this time it took ~3 min to trigger app refresh by config map update and no watch restart for ConfigMap happened during that time. I'm curious if amount of config maps and using lock during event processing can contribute to slowness.

@daftping
Copy link
Contributor

This is not directly related to this PR, but I want to mention that we observe similar behavior on large clusters where changes to ConfigMaps (or other objects) are significantly delayed. The delay between when Argo CD receives watch events and when it processes them can be as long as 7 minutes or more.

We've found that this issue is particularly pronounced in large clusters with frequently modified objects. Unfortunately, vertically scaling the application controller or tuning its parameters does not help. Splitting one cluster into multiple shards is also not feasible. The event processing logic appears to be a bottleneck.

Here are some metrics showing how Argo CD choked and stopped processing events:
Conversation in slack

I have a reproducible step for this issue, but I haven’t filed the issue yet.

@andrii-korotkov-verkada
Copy link
Contributor Author

This is not directly related to this PR, but I want to mention that we observe similar behavior on large clusters where changes to ConfigMaps (or other objects) are significantly delayed. The delay between when Argo CD receives watch events and when it processes them can be as long as 7 minutes or more.

We've found that this issue is particularly pronounced in large clusters with frequently modified objects. Unfortunately, vertically scaling the application controller or tuning its parameters does not help. Splitting one cluster into multiple shards is also not feasible. The event processing logic appears to be a bottleneck.

Here are some metrics showing how Argo CD choked and stopped processing events: Conversation in slack

I have a reproducible step for this issue, but I haven’t filed the issue yet.

I wonder if some delays in processing are due to the same issue this PR tries to address - race condition and processing app operation on a stale data. If you can and want to, feel free to test this PR in your environment and see if it changes anything. From my observations so far, waiting time for config maps are usually within 1-3 minutes, though sometimes a bit longer like 5-7 minutes. But I don't see 10+ or 20+ minutes which was happening before.

@andrii-korotkov-verkada
Copy link
Contributor Author

One more idea I have is to reduce the timeout of watches from 10min to 5min or so. Not entirely sure it'd help as much, but one case of update after watch just restarted hints that it might.

@andrii-korotkov-verkada
Copy link
Contributor Author

With watch resync of 5 min most observed waiting finished within 1-3 min, rarely ~5 min. I think it's a good change.

@andrii-korotkov-verkada
Copy link
Contributor Author

andrii-korotkov-verkada commented Jun 30, 2024

I've updated the PR to revert a part of changes and rebased on latest master.

@andrii-korotkov-verkada
Copy link
Contributor Author

@agaudreault, can you take another look at the updated PR when you have time, please?

@agaudreault
Copy link
Member

From a code review perspective, the code change seems to make sense. However, since this changes something in the main reconciliation/sync loop, I think it would be important to be able to visualize the effect that this change has. Apart from manual observations, are there any logs/metrics that would allow us to see the improvement and monitor the effect of the change?

From the comments above, it would seem like there is a race condition, but this does not fix it, it only reduces its occurence. How can we find when the race condition happens? are there any observable data/logs that allows to detect it and how many times it happens?

Without objective data, it is hard to judge if this change improves the sync behavior, without side effects.

@andrii-korotkov-verkada
Copy link
Contributor Author

From a code review perspective, the code change seems to make sense. However, since this changes something in the main reconciliation/sync loop, I think it would be important to be able to visualize the effect that this change has. Apart from manual observations, are there any logs/metrics that would allow us to see the improvement and monitor the effect of the change?

From the comments above, it would seem like there is a race condition, but this does not fix it, it only reduces its occurence. How can we find when the race condition happens? are there any observable data/logs that allows to detect it and how many times it happens?

Without objective data, it is hard to judge if this change improves the sync behavior, without side effects.

I've been testing this with logs in Datadog with filters like service:argocd,container_name:argocd-application-controller,"waiting for healthy state of /ConfigMap",env:staging. Then I'd look for logs with "Updating operation state" and "Resuming in-progress operation" for some components/applications. They'd roughly show me a timeline of how the status is changing from waiting for many resources to waiting for a config map to then operations running. For example
Screenshot 2024-07-02 at 8 05 50 AM
It won't show cases where wait was almost non-existent, but still reveals max waiting times.

Regarding the race condition, I think most likely this particular instance of it is getting fixed. Since there's refresh and operation queue deduplication for the same app key, that may not be 100%, but also there might not be much we can do. However, extra waiting time for config maps seems to be coming from delay in Kubernetes watch - based on logs of "Requesting app refresh" with structured data corresponding to a config map. Workqueue depth is pretty low, so we might just be in a state of many objects in the cluster with many updates.

@andrii-korotkov-verkada
Copy link
Contributor Author

FWIW, the code has been running for us for a few weeks now and I haven't gotten new complaints about applications getting stuck waiting for a config map healthy state.

@andrii-korotkov-verkada andrii-korotkov-verkada force-pushed the 18500-do-app-operation-after-refresh branch 11 times, most recently from f63d829 to bac368c Compare July 26, 2024 16:55
@andrii-korotkov-verkada
Copy link
Contributor Author

I've added the e2e test for the scenario and also tested without my change in #19251. The test passes, I guess I'll do a few more tries without my change trying to make race condition to trigger, and probably add a loop.

@andrii-korotkov-verkada
Copy link
Contributor Author

I've tried #19251 with 10 iterations and still didn't have it timeout. Seems like getting stuck is hard to reproduce. But at least the e2e test covers complete breakage of the flow.

@andrii-korotkov-verkada
Copy link
Contributor Author

I've added small optimization for a common case where delay on app update is 0 (default value) - removing extra putting in the app operation queue. I've also re-instanced additional putting to app queue on app addition with the same logic - first refresh and then do sync on fresher data.

@andrii-korotkov-verkada
Copy link
Contributor Author

I've also swapped the order of adding to app operation queue and marking the refresh task done.

@andrii-korotkov-verkada
Copy link
Contributor Author

Can't repro the issue on master with a simple e2e test even when running 100 times.

@andrii-korotkov-verkada andrii-korotkov-verkada force-pushed the 18500-do-app-operation-after-refresh branch 4 times, most recently from d425dd9 to 71a4913 Compare July 27, 2024 13:51
… avoid race condition - Fixes [argoproj#18500]

Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes.
Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh.

The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted. If not, an update can be made to have refresh queue storing a tuple with app key and bool whether to enqueue app operation.

If there are issues: try keeping both old places to add to app operation queue and new addition after refresh.

Note on cherry pick: add to as many releases as you can. This can be a significant performance boost.

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
@andrii-korotkov-verkada
Copy link
Contributor Author

andrii-korotkov-verkada commented Jul 29, 2024

A summary after recent updates:

  • I've added a simple e2e test with adding several config maps with different sync waves and tested it separately with current master on 100 iterations chore: Integration test for creating config map and expecting updates to happen #19251. I was not able to reproduce the sync getting stuck with it. Note, that I've updated the health check to cover the case when health is not present, which would cause a panic before and which seems like a common case for config maps when they got in a good non-missing state.
  • In a common case of resource event the amount of items put into the refresh and operation queue remain the same, they are just sequenced now.
  • When an app gets added, the amount of items put into the refresh and operation queue remain the same, they are just sequenced now.
  • When an app gets updated, the amount of items put into the refresh queue remains the same, while for operation queue it would be the same or less (if delay is 0, we don't put extra item).
  • In a rare case of an app getting deleted, there'd be one extra put in the operation queue, but it should be processed quickly due to a check that app exists.

@alexmt, please take a look when you have time and see if concerns were addressed well.

Copy link
Collaborator

@alexmt alexmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Notes from our discussion. Adding app to both queues at the same time for a mistake, since definitely don't want to reconcile sync status and operation at the same time.

I'm not concerned about performance as well. The requestAppRefresh is executed very frequently e.g. when child resource like Pod or ReplicaSet changes status. So it will reduce number of app operation reconciliations.

@alexmt alexmt merged commit 267f243 into argoproj:master Jul 31, 2024
28 of 29 checks passed
mpelekh added a commit to mpelekh/argo-cd that referenced this pull request Aug 1, 2024
… avoid race condition - Fixes [argoproj#18500] (argoproj#18694)

Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes.
Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh.

The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted. If not, an update can be made to have refresh queue storing a tuple with app key and bool whether to enqueue app operation.

If there are issues: try keeping both old places to add to app operation queue and new addition after refresh.

Note on cherry pick: add to as many releases as you can. This can be a significant performance boost.

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
rhyswilliamsza pushed a commit to rhyswilliamsza/argo-cd that referenced this pull request Aug 12, 2024
… avoid race condition - Fixes [argoproj#18500] (argoproj#18694)

Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes.
Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh.

The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted. If not, an update can be made to have refresh queue storing a tuple with app key and bool whether to enqueue app operation.

If there are issues: try keeping both old places to add to app operation queue and new addition after refresh.

Note on cherry pick: add to as many releases as you can. This can be a significant performance boost.

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
Signed-off-by: Rhys Williams <rhys.williams@electrum.co.za>
mpelekh added a commit to mpelekh/argo-cd that referenced this pull request Sep 10, 2024
… avoid race condition - Fixes [argoproj#18500] (argoproj#18694)

Adding app to the operation queue and refresh queue could cause waiting for resource for minutes to tens of minutes.
Sync state operates on resources gathered from reconciliation, so if the app operation event is processed before the refresh one (when triggered on resource update/creation), the refresh doesn’t help sync to progress and it essentially needs to wait for another app refresh.

The fix seems to be to schedule app operation event after refresh event is finished processing. There’s one place where operation event is scheduled without refresh event (which can be kept there), and one place where refresh even is scheduled without the operation one during the app deletion handling https://github.com/argoproj/argo-cd/blob/3e2cfb138795e24df98c9745d813e4b7f1720dbd/controller/appcontroller.go#L2177. It’s probably safe to schedule operation even after that, since it has some code to check that app was deleted. If not, an update can be made to have refresh queue storing a tuple with app key and bool whether to enqueue app operation.

If there are issues: try keeping both old places to add to app operation queue and new addition after refresh.

Note on cherry pick: add to as many releases as you can. This can be a significant performance boost.

Signed-off-by: Andrii Korotkov <andrii.korotkov@verkada.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants