-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fleet runs out of memory when listing software in the hosts collection endpoint for several concurrent requests involving hosts with large software lists #22291
Comments
Additional thought. Do we as a practice need to be passing in |
Assigning to Endpoint Ops team for further investigation. |
This also happened to another cloud customer with 4GB of memory allocated. This one had less than 3k hosts. Adding the customer label |
@sharon-fdm I'm elevating this to a P2 because we're seeing an upward trend in memory usage since 4.56.0. |
@lukeheath, We will try to assign someone this sprint with some risk that we can't get to it. |
@georgekarrv Getting this on your radar, as well, in case anyone on MDM has capacity. |
@rfairburn Is this bug related to #22122? |
Not that I know of. |
Hey team! Please add your planning poker estimate with Zenhub @mostlikelee @iansltx @lucasmrod @getvictor |
@rfairburn Have we seen any further incidence of this issue since last reported? |
Looking at the original event (with a bit of extra context not posted here, for one of the tagged customers), this particular CPU/RAM usage spike appears to have been due to an enumeration script that pulled the following for all teams:
Those requests were pretty concurrent, and we returned large payloads (50-105 MB) on those responses. My bet is that a large number of hosts times a large number of software packages per host (with software packages needing to be repeated per host) times a large number of teams being hit with the same request pushed RAM usage over the edge, and probably spun the database a bit as well. Going to take a quick look at how this endpoint works/how data shape (host count, package count) affects response payload size. Will wait for confirmation that this is the most significant RAM issue we've seen (and wait for some more supporting info on the shape of the data we're working with) before digging super deep here, but my hunch is that we either need to optimize the existing hosts endpoint when software is expanded, or add an endpoint/query option that allows us to deliver this data more efficiently. |
Of note, I'll probably rescope this ticket to address a unique source of memory exhaustion if there are multiple different scenarios, so we can close issues as they're resolved without missing that there's still work to do elsewhere in the app. |
Taking a quick look at dogfood, our average host size in the response when including software is 250KB (25MB for 100 hosts). The average climbs to 340KB per host when looking at a smaller set that's a bit more Linux-y, and pulling a single random Linux host's software endpoint (slightly more verbose but same ballpark) hands back ~600KB when I grab all results rather than just the first 20. So ~100 MB of data on 100 hosts is within the realm of reason. Looking at the endpoint implementation itself, we have an N+1 when retrieving software, and data structures are built independently for identical pieces of software across multiple hosts. Given that we want to maintain the same response format for this endpoint, my guess is that we could aggregate software/paths, reuse structs that have the same data inside, and fan out to JSON as late as possible in the process. I'd have to profile memory usage to see whether that's actually where the problem is, and my bet is that JSON response building isn't streamed, but we should be able to get RAM usage down here a bit, and also make this endpoint a lot lighter on the DB if we want to support efficiently dispatching this use case. |
Based on internal discussions, rescoping this issue to the one we have documentation/logs readily available for, which is the one I'm covering in my comments. I'm sure I'd be able to repro this as a spike in RAM usage (and probably CPU) by grabbing e.g. 500-1000 hosts plus software in a single request (vs. concurrently requesting 100 hosts per team across multiple teams). Just a question of prioritization given that lowering concurrency on the one-off script would have worked around the issue. If we wind up with other OOMs from this issue, that can push priority here upward. If we wind up with other OOMs for other reasons, those can be ticketed separately, as while we're probably using more RAM now due to e.g. additional MDM crons, a specific OOM will show us where the low-hanging fruit is. |
Sorry, I missed you here (which comments?). Are we updating documentation? (Am ok with that, just checking.) |
Chatted with @lucasmrod on this and yep, makes sense to document that this particular endpoint may be RAM-intensive, so users should tread lightly with page size/concurrency when asking for embedded software. Will get a PR up for that here in a bit. |
#22291 --------- Co-authored-by: Rachael Shaw <r@rachael.wtf>
After a bit more investigation, this looks to be due to high-concurrency calls on /hosts/{id}, which is being tracked as part of #23078. @rfairburn feel free to continue dropping OOM issues here as long as this is open and I'll triage them into the proper ticket as needed, either creating new ones or tacking them onto existing tickets as needed. Acceptance criteria for this ticket will remain constrained to the one endpoint/query string combo mentioned in the (edited) ticket title so there's a chance that this gets closed vs. being an ocean-boiling Epic :) |
a customer also reported this issue using large (or no) pagination sizes on |
So, #23078 and this ticket have actually converged (just caught up on the customer back-and-forth there). as original scope on this one was determined to be "hosts collection endpoint worh software set to true", and due to various clarifications/discussions #23078 is now scoped as the same thing. To deduplicate and avoid things getting lost, I'll get bugs filed for |
Issues have been split out. Closing this as it's now effectively a duplicate of #23078, which is a P2 so will land this sprint. |
Large software lists, |
Fleet version: v4.56.0 (but also at least the previous 2 versions)
Web browser and operating system: N/A
💥 Actual behavior
Fleet container crashed when 6GB allocated.
Details:
The APM results show increased container CPU and request latency around the time of the issue. There are a number of http requests around the time of the issue to retrieve hosts in batches of 100 that were sending over 100MB of data each. It is possible that these could be related. See private slack thread https://fleetdm.slack.com/archives/C03EG80BM2A/p1726910791814149?thread_ts=1726909537.718839&cid=C03EG80BM2A for details (Contains customer data so not pasted here)
🧑💻 Steps to reproduce
Unknown, but it appears that iterating over hosts (85-100k) with a lot of software and grabbing that could be at least partially at play.
🕯️ More info (optional)
I haven't been able to isolate this to a specific API call. It does not currently appear to be uploading software installers.
Hopefully this can be reproduced in load testing, even if it takes using 4GB instead of 6 (as we recommend 4GB in all cases per container).
Scope:
@lukeheath, @rfairburn, I am suggesting this scope. Feel free to comment.
The text was updated successfully, but these errors were encountered: