Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Fix flaky TestQueued #6097

Closed
wants to merge 7 commits into from
Closed

Conversation

fasmat
Copy link
Member

@fasmat fasmat commented Jul 3, 2024

Motivation

This should once and for all fix the flaky TestQueued test that keeps preventing merges of PRs. 🙂

Description

I fixed various issues with races in the test that caused it to be so unstable:

  • split test into two:
    • Test_Queued to check if too many requests are in flight at the same time new requests are not accepted (same goal as original test)
    • Test_RequestInterval that checks if requests are correctly rate limited
  • Moved initialization code for server.Server in its Run method to the New function instead, this prevents the issue that clients within a test might make a request before the server is even ready to accept that request.
  • instead of a buffered channel representing a queue I now use an unbuffered channel and a semaphore.
    • Before incoming requests would fill up the queue channel and simultaneously the worker routine would drain that channel again. This leads to an off by one error in terms of how many items can be in the queue at the same time.
    • The reason it even blocked on the receiver side of the queue when it full was because the underlying errgroup.Group had a limit set. With a semaphore this becomes more straight forward and easier to understand. It also fixes the off by one error.
    • Shutdown now also cleans up properly. Before requests in the queue channel were not handled when the server was shut down, now on the sender side of the chan those requests are handled on shutdown
  • Within the test I changed the code in such a way that the first 10 requests that fill the queue block and then assert that every additional request fails (as expected). The successful requests only unblock after the assertion for the failing ones complete, removing another potential issue where depending on the test runners speed more requests could fail than expected. Now the test should pass independently of the runners speed with the exact number of expected successful and failing requests.
  • With the smaller queue size the test now runs much faster than before (~ 10 ms on my machine) which makes verifying that it is now stable with go test -failfast -count=2000 -run ^Test_Queued$ github.com/spacemeshos/go-spacemesh/p2p/server much easier 🙂

Test Plan

  • tests are now stable

TODO

  • Explain motivation or link existing issue(s)
  • Test changes and document test plan
  • Update documentation as needed
  • Update changelog as needed

@fasmat fasmat self-assigned this Jul 3, 2024
Copy link

codecov bot commented Jul 3, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.

Project coverage is 81.5%. Comparing base (0ec1b2e) to head (c268a18).

Files Patch % Lines
p2p/server/server.go 71.4% 5 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##           develop   #6097     +/-   ##
=========================================
- Coverage     81.6%   81.5%   -0.1%     
=========================================
  Files          302     302             
  Lines        32454   32466     +12     
=========================================
+ Hits         26487   26492      +5     
- Misses        4235    4239      +4     
- Partials      1732    1735      +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

srv.metrics.targetRps.Set(float64(srv.limit.Limit()))
}
srv.h.SetStreamHandler(protocol.ID(srv.protocol), func(stream network.Stream) {
srv.queue <- request{stream: stream, received: time.Now()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a select case in the block that was moved. Any particular reason why to remove the default case? Ideally there would be a context here that could be selected on, but also the problem is the rate limiting which could result in messages being dropped (but also, goroutines could pile up waiting to write to this channel, which would have other repercussions, like messages timing out on the other end). It might be useful to dwell on this a bit longer.

A bit of food for thought that maybe doing rate-limiting on the p2p level is not a good approach - some protocols might be more aggressive than others on resources, which could result in resource starvation on others. This means that having rate-limiting on the p2p level won't solve the problem of resources necessarily and also removes the possibility of prioritizing some protocols over others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the reading side of the channel - every incoming request is immediately handled and if it exceeds the queue size it will be dropped immediately. I did this so that TryAcquire and Release calls to the semaphore are in the same code block.

I can add the TryAcquire to the callback of SetStreamHandler and only if it passes that check send the request to the queue to be handled in Run but then the Release would be disconnected from the TryAcquire call 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the reader goes away then the goroutines pile up on shutdown. I'm not sure which repercussions this may have (or not have). As a rule of thumb I think we should try to avoid writing directly to channels (without selects). But that's just me...
It could be that it is inconsequential, but it could potentially have some side-effects. I'm not so familiar with the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the SetStreamHandler callback in a way which does the handling on the sender side of the channel instead of the receiver side, see b0b6cf9. It has the advantage of dropping requests on the sending side of the channel if a) the context is already done or b) the semaphore is full, but since now the callback has to be set in the Run method I need an additional check with the Ready method because otherwise the first request might be sent from the client to the server before the SetStreamHandler callback has been set causing the test to halt.

So it is either:

  • waiting on Ready after calling Run
  • handling limiting and context checking in the Run method instead of in the SetStreamHandler callback

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK nvm, I found a way to do it without the need for the Ready method 🙂

@fasmat fasmat requested a review from acud July 4, 2024 00:14
Copy link
Contributor

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

@fasmat
Copy link
Member Author

fasmat commented Jul 4, 2024

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024
## Motivation

This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂
@spacemesh-bors
Copy link

spacemesh-bors bot commented Jul 4, 2024

Build failed:

  • systest-status

rate.Every(srv.interval/time.Duration(srv.requestsPerInterval)),
srv.requestsPerInterval,
)
srv.sem = semaphore.NewWeighted(int64(srv.queueSize))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using a goroutine pool like https://github.com/panjf2000/ants? It can automatically control how many tasks are executed concurrently, reuse goroutines instead of creating a new one for each request (which is cheap but not free), scale the workers down if not used, etc.

Copy link
Member Author

@fasmat fasmat Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look at ants but I'm not sure it is a good fit here. We have 2 situations that we want to limit:

  • if the queue is full, we want to drop incoming requests immediately. I don't quite see how to make this possible with Pool.Submit() - it would have to be WithNonBlocking(true) but then we don't know if the submit actually started a routine or not (because if it didn't we need to close the stream and update metrics accordingly)
  • a maximum rate of incoming requests: the ant.Pool will always process as many requests concurrently as it has capacity, but we actually want to limit to at most x requests per timeframe. If requests come in at a higher rate those should block for the time necessary until the requests per timeframe is below the targeted rate

I don't see ants as a good fit for these requirements - as far as I understand the package it just ensures that no more than a certain number of requests are inflight at the same time while any new incoming request either blocks until a running one is finished or immediately fails without the more fine-grained behaviour we have now 😕

EDIT: probably the given requirements can be met with ants and some extra code, but we would then only replace semaphore with it while still needing the rate limiter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of a goroutine pool only to achieve 2 goals:

  • to reuse goroutines
  • to control how many streams are processed concurrently (via pool size)

We would need to use the pool in the blocking mode + use the rate-limiter and queue (for buffering waiting requests) as we do now.

@fasmat
Copy link
Member Author

fasmat commented Jul 4, 2024

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024
## Motivation

This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂
@spacemesh-bors
Copy link

spacemesh-bors bot commented Jul 4, 2024

Build failed (retrying...):

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024
## Motivation

This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂
@spacemesh-bors
Copy link

spacemesh-bors bot commented Jul 4, 2024

Build failed:

@fasmat
Copy link
Member Author

fasmat commented Jul 4, 2024

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024
## Motivation

This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂
@spacemesh-bors
Copy link

spacemesh-bors bot commented Jul 4, 2024

Build failed:

@fasmat
Copy link
Member Author

fasmat commented Jul 4, 2024

bors merge

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024
## Motivation

This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂
@spacemesh-bors
Copy link

spacemesh-bors bot commented Jul 4, 2024

Pull request successfully merged into develop.

Build succeeded:

@spacemesh-bors spacemesh-bors bot changed the title Fix flaky TestQueued [Merged by Bors] - Fix flaky TestQueued Jul 4, 2024
@spacemesh-bors spacemesh-bors bot closed this Jul 4, 2024
@spacemesh-bors spacemesh-bors bot deleted the fix-flaky-testqueued branch July 4, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants