[Merged by Bors] - Fix flaky TestQueued #6097

fasmat · 2024-07-03T22:07:33Z

Motivation

This should once and for all fix the flaky TestQueued test that keeps preventing merges of PRs. 🙂

Description

I fixed various issues with races in the test that caused it to be so unstable:

split test into two:
- Test_Queued to check if too many requests are in flight at the same time new requests are not accepted (same goal as original test)
- Test_RequestInterval that checks if requests are correctly rate limited
Moved initialization code for server.Server in its Run method to the New function instead, this prevents the issue that clients within a test might make a request before the server is even ready to accept that request.
instead of a buffered channel representing a queue I now use an unbuffered channel and a semaphore.
- Before incoming requests would fill up the queue channel and simultaneously the worker routine would drain that channel again. This leads to an off by one error in terms of how many items can be in the queue at the same time.
- The reason it even blocked on the receiver side of the queue when it full was because the underlying errgroup.Group had a limit set. With a semaphore this becomes more straight forward and easier to understand. It also fixes the off by one error.
- Shutdown now also cleans up properly. Before requests in the queue channel were not handled when the server was shut down, now on the sender side of the chan those requests are handled on shutdown
Within the test I changed the code in such a way that the first 10 requests that fill the queue block and then assert that every additional request fails (as expected). The successful requests only unblock after the assertion for the failing ones complete, removing another potential issue where depending on the test runners speed more requests could fail than expected. Now the test should pass independently of the runners speed with the exact number of expected successful and failing requests.
With the smaller queue size the test now runs much faster than before (~ 10 ms on my machine) which makes verifying that it is now stable with go test -failfast -count=2000 -run ^Test_Queued$ github.com/spacemeshos/go-spacemesh/p2p/server much easier 🙂

Test Plan

tests are now stable

TODO

Explain motivation or link existing issue(s)
Test changes and document test plan
Update documentation as needed
Update changelog as needed

codecov · 2024-07-03T22:25:40Z

Codecov Report

Attention: Patch coverage is 71.42857% with 8 lines in your changes missing coverage. Please review.

Project coverage is 81.5%. Comparing base (0ec1b2e) to head (c268a18).

Files	Patch %	Lines
p2p/server/server.go	71.4%	5 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##           develop   #6097     +/-   ##
=========================================
- Coverage     81.6%   81.5%   -0.1%     
=========================================
  Files          302     302             
  Lines        32454   32466     +12     
=========================================
+ Hits         26487   26492      +5     
- Misses        4235    4239      +4     
- Partials      1732    1735      +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

acud · 2024-07-03T22:39:49Z

p2p/server/server.go

+		srv.metrics.targetRps.Set(float64(srv.limit.Limit()))
+	}
+	srv.h.SetStreamHandler(protocol.ID(srv.protocol), func(stream network.Stream) {
+		srv.queue <- request{stream: stream, received: time.Now()}


There was a select case in the block that was moved. Any particular reason why to remove the default case? Ideally there would be a context here that could be selected on, but also the problem is the rate limiting which could result in messages being dropped (but also, goroutines could pile up waiting to write to this channel, which would have other repercussions, like messages timing out on the other end). It might be useful to dwell on this a bit longer.

A bit of food for thought that maybe doing rate-limiting on the p2p level is not a good approach - some protocols might be more aggressive than others on resources, which could result in resource starvation on others. This means that having rate-limiting on the p2p level won't solve the problem of resources necessarily and also removes the possibility of prioritizing some protocols over others.

Check the reading side of the channel - every incoming request is immediately handled and if it exceeds the queue size it will be dropped immediately. I did this so that TryAcquire and Release calls to the semaphore are in the same code block.

I can add the TryAcquire to the callback of SetStreamHandler and only if it passes that check send the request to the queue to be handled in Run but then the Release would be disconnected from the TryAcquire call 🤷

If the reader goes away then the goroutines pile up on shutdown. I'm not sure which repercussions this may have (or not have). As a rule of thumb I think we should try to avoid writing directly to channels (without selects). But that's just me...
It could be that it is inconsequential, but it could potentially have some side-effects. I'm not so familiar with the code.

I updated the SetStreamHandler callback in a way which does the handling on the sender side of the channel instead of the receiver side, see b0b6cf9. It has the advantage of dropping requests on the sending side of the channel if a) the context is already done or b) the semaphore is full, but since now the callback has to be set in the Run method I need an additional check with the Ready method because otherwise the first request might be sent from the client to the server before the SetStreamHandler callback has been set causing the test to halt.

So it is either:

waiting on Ready after calling Run

handling limiting and context checking in the Run method instead of in the SetStreamHandler callback

OK nvm, I found a way to do it without the need for the Ready method 🙂

acud

Nice, thanks!

fasmat · 2024-07-04T00:58:48Z

bors merge

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

spacemesh-bors · 2024-07-04T01:42:24Z

Build failed:

systest-status

poszu · 2024-07-04T07:12:24Z

p2p/server/server.go

+		rate.Every(srv.interval/time.Duration(srv.requestsPerInterval)),
+		srv.requestsPerInterval,
+	)
+	srv.sem = semaphore.NewWeighted(int64(srv.queueSize))


What do you think about using a goroutine pool like https://github.com/panjf2000/ants? It can automatically control how many tasks are executed concurrently, reuse goroutines instead of creating a new one for each request (which is cheap but not free), scale the workers down if not used, etc.

I took a quick look at ants but I'm not sure it is a good fit here. We have 2 situations that we want to limit:

if the queue is full, we want to drop incoming requests immediately. I don't quite see how to make this possible with Pool.Submit() - it would have to be WithNonBlocking(true) but then we don't know if the submit actually started a routine or not (because if it didn't we need to close the stream and update metrics accordingly)

a maximum rate of incoming requests: the ant.Pool will always process as many requests concurrently as it has capacity, but we actually want to limit to at most x requests per timeframe. If requests come in at a higher rate those should block for the time necessary until the requests per timeframe is below the targeted rate

I don't see ants as a good fit for these requirements - as far as I understand the package it just ensures that no more than a certain number of requests are inflight at the same time while any new incoming request either blocks until a running one is finished or immediately fails without the more fine-grained behaviour we have now 😕

EDIT: probably the given requirements can be met with ants and some extra code, but we would then only replace semaphore with it while still needing the rate limiter.

I was thinking of a goroutine pool only to achieve 2 goals:

to reuse goroutines

to control how many streams are processed concurrently (via pool size)

We would need to use the pool in the blocking mode + use the rate-limiter and queue (for buffering waiting requests) as we do now.

fasmat · 2024-07-04T07:12:28Z

bors merge

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

spacemesh-bors · 2024-07-04T08:46:43Z

Build failed (retrying...):

ci-status

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

spacemesh-bors · 2024-07-04T10:20:29Z

Build failed:

systest-status

fasmat · 2024-07-04T11:37:05Z

bors merge

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

spacemesh-bors · 2024-07-04T13:20:06Z

Build failed:

systest-status

fasmat · 2024-07-04T15:37:42Z

bors merge

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

spacemesh-bors · 2024-07-04T17:08:29Z

Pull request successfully merged into develop.

Build succeeded:

Fix flaky Test_Queued

8d689b2

fasmat self-assigned this Jul 3, 2024

fasmat requested review from dshulyak, poszu, ivan4th and acud as code owners July 3, 2024 22:07

Linter

cde7d8c

Add additional test for RequestInterval

c4c308c

fasmat force-pushed the fix-flaky-testqueued branch from f8578c1 to c4c308c Compare July 3, 2024 22:46

acud reviewed Jul 3, 2024

View reviewed changes

fasmat added 3 commits July 3, 2024 23:01

Clarify what requests per interval means

035701c

Review

b0b6cf9

Review 2: electric boogaloo

d4ba8ba

fasmat force-pushed the fix-flaky-testqueued branch from 9aceb84 to d4ba8ba Compare July 4, 2024 00:09

fasmat requested a review from acud July 4, 2024 00:14

acud approved these changes Jul 4, 2024

View reviewed changes

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024

Fix flaky TestQueued (#6097)

61e390f

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

poszu reviewed Jul 4, 2024

View reviewed changes

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024

Fix flaky TestQueued (#6097)

5afa2d4

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024

Fix flaky TestQueued (#6097)

607cc9a

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024

Fix flaky TestQueued (#6097)

1203be1

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

Merge remote-tracking branch 'origin/develop' into fix-flaky-testqueued

c268a18

spacemesh-bors bot pushed a commit that referenced this pull request Jul 4, 2024

Fix flaky TestQueued (#6097)

0712be8

## Motivation This should once and for all fix the flaky `TestQueued` test that keeps preventing merges of PRs. 🙂

spacemesh-bors bot changed the title ~~Fix flaky TestQueued~~ [Merged by Bors] - Fix flaky TestQueued Jul 4, 2024

spacemesh-bors bot closed this Jul 4, 2024

spacemesh-bors bot deleted the fix-flaky-testqueued branch July 4, 2024 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Fix flaky TestQueued #6097

[Merged by Bors] - Fix flaky TestQueued #6097

fasmat commented Jul 3, 2024 •

edited

Loading

codecov bot commented Jul 3, 2024 •

edited

Loading

acud Jul 3, 2024

fasmat Jul 3, 2024

acud Jul 3, 2024

fasmat Jul 3, 2024

fasmat Jul 4, 2024

acud left a comment

fasmat commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

poszu Jul 4, 2024

fasmat Jul 4, 2024 •

edited

Loading

poszu Jul 4, 2024

fasmat commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

fasmat commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

fasmat commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

[Merged by Bors] - Fix flaky TestQueued #6097

[Merged by Bors] - Fix flaky TestQueued #6097

Conversation

fasmat commented Jul 3, 2024 • edited Loading

Motivation

Description

Test Plan

TODO

codecov bot commented Jul 3, 2024 • edited Loading

Codecov Report

acud Jul 3, 2024

Choose a reason for hiding this comment

fasmat Jul 3, 2024

Choose a reason for hiding this comment

acud Jul 3, 2024

Choose a reason for hiding this comment

fasmat Jul 3, 2024

Choose a reason for hiding this comment

fasmat Jul 4, 2024

Choose a reason for hiding this comment

acud left a comment

Choose a reason for hiding this comment

fasmat commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

poszu Jul 4, 2024

Choose a reason for hiding this comment

fasmat Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

poszu Jul 4, 2024

Choose a reason for hiding this comment

fasmat commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

fasmat commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

fasmat commented Jul 4, 2024

spacemesh-bors bot commented Jul 4, 2024

fasmat commented Jul 3, 2024 •

edited

Loading

codecov bot commented Jul 3, 2024 •

edited

Loading

fasmat Jul 4, 2024 •

edited

Loading