Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify delivery loop logic #1743

Merged
merged 1 commit into from
Dec 9, 2024
Merged

Simplify delivery loop logic #1743

merged 1 commit into from
Dec 9, 2024

Conversation

fractalwrench
Copy link
Contributor

@fractalwrench fractalwrench commented Dec 4, 2024

Goal

Simplifies the delivery loop logic to avoid checking state as this was causing flakes in our integration tests.

It's worth noting that (1) deliveryLoop can now be called multiple times and (2) queueDelivery is much more likely to fail as a consequence of this change, once old payloads are deleted from disk. FWIW, both of these conditions could occur with the old approach too via scheduleDeliveryLoopForNextRetry.

One alternative given that deliveryWorker is only ever used in this class would be to have deliveryWorker infinitely poll a BlockingQueue, and to have schedulingWorker add to that queue. This adds complications around checking for uniqueness on the elements in the queue & is a larger change, however.

Alternatively, given this is just an issue in testing & not really with deliverability we could just update our test harness to flush any payloads that are left hanging.

Testing

I ran the integration tests ~15 times and they passed consistently with these changes, whereas without these changes one test case usually fails on every other run.

Copy link
Contributor

github-actions bot commented Dec 4, 2024

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

OpenSSF Scorecard

PackageVersionScoreDetails

Scanned Files

Copy link

codecov bot commented Dec 4, 2024

Codecov Report

Attention: Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.34%. Comparing base (9b5b93e) to head (6c214b9).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...esdk/internal/delivery/debug/DeliveryTraceState.kt 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1743      +/-   ##
==========================================
+ Coverage   85.31%   85.34%   +0.03%     
==========================================
  Files         464      464              
  Lines       10783    10775       -8     
  Branches     1596     1591       -5     
==========================================
- Hits         9199     9196       -3     
  Misses        870      870              
+ Partials      714      709       -5     
Files with missing lines Coverage Δ
...bracesdk/internal/delivery/debug/DeliveryTracer.kt 82.60% <100.00%> (ø)
...ernal/delivery/scheduling/SchedulingServiceImpl.kt 87.82% <100.00%> (+0.02%) ⬆️
...esdk/internal/delivery/debug/DeliveryTraceState.kt 55.10% <50.00%> (ø)

... and 3 files with indirect coverage changes

Copy link
Collaborator

@bidetofevil bidetofevil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code, I think simplifying is the right choice.

But looking at this logic, I think we can simplify even further - instead of calling createPayloadQueue to return all the outstanding payloads, we just need the next one - the queue will be refreshed at every iteration anyway.

Basically, this means we are managing the loop via jobs to schedulingWorker, which will ensure that at every iterate, we look at the latest state of the in-memory cache, which seems fine? If there's a excess of jobs that run and find that nothing is to be delivered, it's not so bad. There shouldn't even be that much backpressure built up as result given how fast the job should run

deliveryTracer?.onStartDeliveryLoop(false)
deliveryTracer?.onStartDeliveryLoop()
schedulingWorker.submit {
deliveryLoop()
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Re: line +78]

Change this method to return only the next payload to be delivered. It'll basically poll at every iteration and bail when there are none to be delivered.

See this comment inline on Graphite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted this but it became problematic as only taking the first payload means it would be necessary to track state for payloads that were attempted & weren't attempted. Without adding that state I simply ran into GC problems as the proposed change would loop indefinitely

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the in-progressed payload should be filtered out in with the filter on activeSends inside StoredTelemetryMetadata.shouldSendPayload(), which is called inside createPayloadQueue. Doesn't that happen if you simply changed createPayloadQueue to return the whole LinkedList but only the first element?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. change that function to:

private fun createPayloadQueue(exclude: Set<StoredTelemetryMetadata> = emptySet()): StoredTelemetryMetadata? {
        val payloadsByPriority = storageService.getPayloadsByPriority()
        val payloadToSend = payloadsByPriority
            .filter { it.shouldSendPayload() && !exclude.contains(it) }
            .sortedWith(storedTelemetryComparator)
            .first()
        deliveryTracer?.onPayloadQueueCreated(
            payloadsByPriority,
            payloadToSend,
        )
        return payloadToSend
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work for the case where a payload isn't enqueued (e.g. if an endpoint is rate-limited) and hangs the application. We could track those payloads too but I'm wary that might be adding more complexity on top. We can chat synchronously about options later today

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard my comment above - I think I just made a mistake in the original implementation of these changes & things now work fine on my latest iteration. There are only 2 states I can see (failed vs active) so this approach should work fine.

Copy link
Collaborator

@bidetofevil bidetofevil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One comment about not returning the entire queue but just the next payload since we'll never need more than that now.

@fractalwrench fractalwrench force-pushed the simplify-delivery-loop branch from 24fd780 to 220a57a Compare December 6, 2024 16:28
@fractalwrench fractalwrench marked this pull request as ready for review December 6, 2024 16:30
@fractalwrench fractalwrench requested a review from a team as a code owner December 6, 2024 16:30
Copy link
Collaborator

@bidetofevil bidetofevil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. We'll watch out for new flakes after this is merged

@fractalwrench fractalwrench force-pushed the simplify-delivery-loop branch from 220a57a to 6c214b9 Compare December 9, 2024 10:03
@fractalwrench fractalwrench merged commit f3c3d7f into main Dec 9, 2024
7 checks passed
@fractalwrench fractalwrench deleted the simplify-delivery-loop branch December 9, 2024 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants