-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify delivery loop logic #1743
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.OpenSSF Scorecard
Scanned Files |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1743 +/- ##
==========================================
+ Coverage 85.31% 85.34% +0.03%
==========================================
Files 464 464
Lines 10783 10775 -8
Branches 1596 1591 -5
==========================================
- Hits 9199 9196 -3
Misses 870 870
+ Partials 714 709 -5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code, I think simplifying is the right choice.
But looking at this logic, I think we can simplify even further - instead of calling createPayloadQueue
to return all the outstanding payloads, we just need the next one - the queue will be refreshed at every iteration anyway.
Basically, this means we are managing the loop via jobs to schedulingWorker
, which will ensure that at every iterate, we look at the latest state of the in-memory cache, which seems fine? If there's a excess of jobs that run and find that nothing is to be delivered, it's not so bad. There shouldn't even be that much backpressure built up as result given how fast the job should run
deliveryTracer?.onStartDeliveryLoop(false) | ||
deliveryTracer?.onStartDeliveryLoop() | ||
schedulingWorker.submit { | ||
deliveryLoop() | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change this method to return only the next payload to be delivered. It'll basically poll at every iteration and bail when there are none to be delivered.
See this comment inline on Graphite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted this but it became problematic as only taking the first payload means it would be necessary to track state for payloads that were attempted & weren't attempted. Without adding that state I simply ran into GC problems as the proposed change would loop indefinitely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the in-progressed payload should be filtered out in with the filter on activeSends
inside StoredTelemetryMetadata.shouldSendPayload()
, which is called inside createPayloadQueue
. Doesn't that happen if you simply changed createPayloadQueue
to return the whole LinkedList but only the first element?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. change that function to:
private fun createPayloadQueue(exclude: Set<StoredTelemetryMetadata> = emptySet()): StoredTelemetryMetadata? {
val payloadsByPriority = storageService.getPayloadsByPriority()
val payloadToSend = payloadsByPriority
.filter { it.shouldSendPayload() && !exclude.contains(it) }
.sortedWith(storedTelemetryComparator)
.first()
deliveryTracer?.onPayloadQueueCreated(
payloadsByPriority,
payloadToSend,
)
return payloadToSend
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work for the case where a payload isn't enqueued (e.g. if an endpoint is rate-limited) and hangs the application. We could track those payloads too but I'm wary that might be adding more complexity on top. We can chat synchronously about options later today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disregard my comment above - I think I just made a mistake in the original implementation of these changes & things now work fine on my latest iteration. There are only 2 states I can see (failed vs active) so this approach should work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One comment about not returning the entire queue but just the next payload since we'll never need more than that now.
24fd780
to
220a57a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. We'll watch out for new flakes after this is merged
220a57a
to
6c214b9
Compare
Goal
Simplifies the delivery loop logic to avoid checking state as this was causing flakes in our integration tests.
It's worth noting that (1)
deliveryLoop
can now be called multiple times and (2)queueDelivery
is much more likely to fail as a consequence of this change, once old payloads are deleted from disk. FWIW, both of these conditions could occur with the old approach too viascheduleDeliveryLoopForNextRetry
.One alternative given that
deliveryWorker
is only ever used in this class would be to havedeliveryWorker
infinitely poll aBlockingQueue
, and to haveschedulingWorker
add to that queue. This adds complications around checking for uniqueness on the elements in the queue & is a larger change, however.Alternatively, given this is just an issue in testing & not really with deliverability we could just update our test harness to flush any payloads that are left hanging.
Testing
I ran the integration tests ~15 times and they passed consistently with these changes, whereas without these changes one test case usually fails on every other run.