Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch event exporter to bulk api #47369

Merged
merged 2 commits into from
Oct 16, 2024

Conversation

fspmarshall
Copy link
Contributor

This PR switches over the event exporter to use the new bulk event export API. Most of the actual changes in this PR relate to reworking the old logic to behave more like the new event exporter helpers introduced in previous PRs so that it is easier to switch from the old API to the new one once the control plane has been upgraded. The one standout exception is a change to how export of session events works.

Prior to this change, the event exporter's session event processing was entirely asynchronous. Each session whose events needed to be processed had its initial cursor state written to disk, and a background goroutine was spawned that would eventually kick off processing of the session when capacity opened up. The new bulk event export API is way too fast for this strategy to work, and audit data with high session density (e.g. massive automated workloads) ends up producing an ever-growing number of cursors on disk and background coroutines which will eventually swamp the host system. In order to mitigate this effect, there is now a fixed sized backlog for unstated sessions and backpressure is exerted on primary event processing if the backlog is filled.

The overall effect of this PR in dedicated test cases is a roughly 8x to 10x increase in event processing throughput, depending on how session-heavy the data is, and what the specified --concurrency value is. On a 128 core machine I found the optimal --concurrency value to be about 256 when churning through a large backlog of session heavy audit log data, though note that the vast majority of teleport clusters do not produce enough audit log data to merit a --concurrency value nearly this high.

Note: this PR (and the preceding bulk event export API work in general) represents a bit of a philosophical shift. A lot less work is done to avoid duplicate and out of order event emission. In general, the event handler now prioritizes event throughput over ordering/deduplication.

Status: PR is good to review, but some manual testing is still ongoing so expect minor ongoing changes.

Closes #46193

Changelog: reworked the teleport-event-handler integration to significantly improve performance, especially when running with larger --concurrency values.

@fspmarshall fspmarshall force-pushed the fspmarshall/bulk-event-export-final branch from b4bc6e7 to 3003694 Compare October 15, 2024 16:10
@fspmarshall fspmarshall requested a review from tigrato October 15, 2024 16:12
@fspmarshall fspmarshall enabled auto-merge October 15, 2024 21:30
@fspmarshall fspmarshall added this pull request to the merge queue Oct 16, 2024
Merged via the queue into master with commit 1238300 Oct 16, 2024
39 checks passed
@fspmarshall fspmarshall deleted the fspmarshall/bulk-event-export-final branch October 16, 2024 15:48
@public-teleport-github-review-bot

@fspmarshall See the table below for backport results.

Branch Result
branch/v14 Failed
branch/v15 Failed
branch/v16 Create PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide a mechanism to export audit events more efficiently and reliably
3 participants