Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Deadlock issue caused in listResources (argoproj/argo-cd#18902) #599

Closed
wants to merge 2 commits into from

Conversation

jdroot
Copy link

@jdroot jdroot commented Jul 3, 2024

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

Fixes argoproj/argo-cd#18902

It is possible for Argo to reach a state where the maximum amount
of go-routines have incremented the semaphore and are waiting on
the lock in loadInitialState WHILE another go-routine is holding
the lock waiting on the semaphore. This change allows the
go-routine holding the lock to bypass the semaphore and complete
its execution. The downside of this approach is that more could
be running than intended, but that is preferable to the cache
deadlocking.

Signed-off-by: James Root <jroot@indeed.com>
pkg/cache/cluster.go Outdated Show resolved Hide resolved
…/argo-cd#18902)

Signed-off-by: James Root <jroot@indeed.com>
func (c *clusterCache) listResources(ctx context.Context, resClient dynamic.ResourceInterface, callback func(*pager.ListPager) error) (string, error) {
if err := c.listSemaphore.Acquire(ctx, 1); err != nil {
return "", err
func (c *clusterCache) listResources(ctx context.Context, resClient dynamic.ResourceInterface, takeSemaphore bool, callback func(*pager.ListPager) error) (string, error) {
Copy link
Member

@agaudreault agaudreault Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working on the "same" issue. I had the idea to skip the semaphore, but while it would solve the problem, it disables the feature added by 605958d

See PR suggestion #604.

Both PRs solve the issue, let's discuss in this thread what option is the best one. @gdsoumya you added the replaceResourceCache as part of the listResource callback (#532). Any specific reason?

@agaudreault
Copy link
Member

Superseded by #604. If memory consumption is too unpredictible with the change, let revisit and check if we allow a mechanism to skip the semaphore 👍 Good job on finding the deadlock nontheless, I wish I had saw your issue sooner!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding CRDs to a cluster causes deadlock
3 participants