switch event exporter to bulk api #47369
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR switches over the event exporter to use the new bulk event export API. Most of the actual changes in this PR relate to reworking the old logic to behave more like the new event exporter helpers introduced in previous PRs so that it is easier to switch from the old API to the new one once the control plane has been upgraded. The one standout exception is a change to how export of session events works.
Prior to this change, the event exporter's session event processing was entirely asynchronous. Each session whose events needed to be processed had its initial cursor state written to disk, and a background goroutine was spawned that would eventually kick off processing of the session when capacity opened up. The new bulk event export API is way too fast for this strategy to work, and audit data with high session density (e.g. massive automated workloads) ends up producing an ever-growing number of cursors on disk and background coroutines which will eventually swamp the host system. In order to mitigate this effect, there is now a fixed sized backlog for unstated sessions and backpressure is exerted on primary event processing if the backlog is filled.
The overall effect of this PR in dedicated test cases is a roughly 8x to 10x increase in event processing throughput, depending on how session-heavy the data is, and what the specified
--concurrency
value is. On a 128 core machine I found the optimal--concurrency
value to be about 256 when churning through a large backlog of session heavy audit log data, though note that the vast majority of teleport clusters do not produce enough audit log data to merit a--concurrency
value nearly this high.Note: this PR (and the preceding bulk event export API work in general) represents a bit of a philosophical shift. A lot less work is done to avoid duplicate and out of order event emission. In general, the event handler now prioritizes event throughput over ordering/deduplication.
Status: PR is good to review, but some manual testing is still ongoing so expect minor ongoing changes.
Closes #46193
Changelog: reworked the
teleport-event-handler
integration to significantly improve performance, especially when running with larger--concurrency
values.