Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fallback to no kretprobes #582

Merged
merged 4 commits into from
Jan 29, 2024

Conversation

grcevski
Copy link
Contributor

@grcevski grcevski commented Jan 26, 2024

We used to track kprobe events by using a socket filter, which was difficult to work with for the cases we cared about. It was not easy to find matching PIDs or read more than 250 bytes. We replaced the socket filter by kprobes/kretprobes on tcp_sendmsg and tcp_recvmsg.

However, as per the bug report discussion in #573, essentially kprobes can be discarded if the events are taking too long, which makes the kprobe on tcp_recvmsg a prime candidate in high network workload scenarios.

To fix this issue, with this PR I'm bringing back the socket filter, targeted only for creating a fallback information for the HTTP requests. It doesn't do the full work as before, it only captures some essential information to be used if the kretprobe on tcp_recvmsg was eliminated.

I should also mention that we have other kretprobes, but they are on sock_alloc and accept4, those will typically be very fast and unlikely to be cancelled, unlike receiving of network buffers.

To test this scenario, I manually removed the code from the retprobe on tcp_recvmsg and tested with Apache2 to see if we still see the routes.

@codecov-commenter
Copy link

codecov-commenter commented Jan 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1b9bbd7) 79.26% compared to head (cbf5174) 79.64%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #582      +/-   ##
==========================================
+ Coverage   79.26%   79.64%   +0.37%     
==========================================
  Files          70       70              
  Lines        5907     5909       +2     
==========================================
+ Hits         4682     4706      +24     
+ Misses       1001      978      -23     
- Partials      224      225       +1     
Flag Coverage Δ
integration-test 69.32% <100.00%> (+0.51%) ⬆️
unittests 44.87% <0.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

// We should make RetprobeMaxActive configurable when we make the num concurrent requests configurable
// By default it sets itself to at least 10, but at most 2 * num cpus, which is low for something like tcp_recvmsg. Max value is 4096.
// https://elixir.bootlin.com/linux/v5.19/source/kernel/kprobes.c#L2202
kp, err := link.Kretprobe(funcName, programs.End, &link.KprobeOptions{RetprobeMaxActive: 1024})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where other eBPF projects set this is either 1024 or the max. Let's start with 1024 and see how this works.

@grcevski grcevski merged commit 78e18f4 into grafana:main Jan 29, 2024
4 checks passed
@grcevski grcevski deleted the handle_missing_kretprobe branch January 29, 2024 14:22
@grcevski
Copy link
Contributor Author

Thanks Mario!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants