Fleet runs out of memory when listing software in the hosts collection endpoint for several concurrent requests involving hosts with large software lists #22291

rfairburn · 2024-09-21T20:08:45Z

Fleet version: v4.56.0 (but also at least the previous 2 versions)

Web browser and operating system: N/A

💥 Actual behavior

Fleet container crashed when 6GB allocated.

Details:

The APM results show increased container CPU and request latency around the time of the issue. There are a number of http requests around the time of the issue to retrieve hosts in batches of 100 that were sending over 100MB of data each. It is possible that these could be related. See private slack thread https://fleetdm.slack.com/archives/C03EG80BM2A/p1726910791814149?thread_ts=1726909537.718839&cid=C03EG80BM2A for details (Contains customer data so not pasted here)

🧑‍💻 Steps to reproduce

Unknown, but it appears that iterating over hosts (85-100k) with a lot of software and grabbing that could be at least partially at play.

🕯️ More info (optional)

I haven't been able to isolate this to a specific API call. It does not currently appear to be uploading software installers.

Hopefully this can be reproduced in load testing, even if it takes using 4GB instead of 6 (as we recommend 4GB in all cases per container).

Scope:
@lukeheath, @rfairburn, I am suggesting this scope. Feel free to comment.

Create an environment and reproduce.
Find the rootcause.
Suggest fix (or fix if simple)

rfairburn · 2024-09-21T20:11:20Z

Additional thought. Do we as a practice need to be passing in GOMEMLIMIT assuming that Go is not automatically detecting the right memory amount and therefore trying to allocate because it thinks it can? Might help improve garbage collection and prevent this.

sharon-fdm · 2024-09-23T13:43:47Z

Assigning to Endpoint Ops team for further investigation.
Unassigning @lukeheath

rfairburn · 2024-09-23T15:34:36Z

This also happened to another cloud customer with 4GB of memory allocated. This one had less than 3k hosts. Adding the customer label

lukeheath · 2024-09-24T15:30:22Z

@sharon-fdm I'm elevating this to a P2 because we're seeing an upward trend in memory usage since 4.56.0.

sharon-fdm · 2024-09-24T17:46:36Z

@lukeheath, We will try to assign someone this sprint with some risk that we can't get to it.

lukeheath · 2024-09-24T18:18:10Z

@georgekarrv Getting this on your radar, as well, in case anyone on MDM has capacity.

lukeheath · 2024-09-24T18:46:17Z

@rfairburn Is this bug related to #22122?

rfairburn · 2024-09-26T15:44:16Z

Not that I know of.

sharon-fdm · 2024-10-03T19:44:16Z

Hey team! Please add your planning poker estimate with Zenhub @mostlikelee @iansltx @lucasmrod @getvictor

iansltx · 2024-10-08T16:27:28Z

@rfairburn Have we seen any further incidence of this issue since last reported?

iansltx · 2024-10-11T15:11:44Z

Looking at the original event (with a bit of extra context not posted here, for one of the tagged customers), this particular CPU/RAM usage spike appears to have been due to an enumeration script that pulled the following for all teams:

GET /api/v1/fleet/hosts?page=0&per_page=100&populate_software=True&team_id={id}

Those requests were pretty concurrent, and we returned large payloads (50-105 MB) on those responses. My bet is that a large number of hosts times a large number of software packages per host (with software packages needing to be repeated per host) times a large number of teams being hit with the same request pushed RAM usage over the edge, and probably spun the database a bit as well.

Going to take a quick look at how this endpoint works/how data shape (host count, package count) affects response payload size. Will wait for confirmation that this is the most significant RAM issue we've seen (and wait for some more supporting info on the shape of the data we're working with) before digging super deep here, but my hunch is that we either need to optimize the existing hosts endpoint when software is expanded, or add an endpoint/query option that allows us to deliver this data more efficiently.

iansltx · 2024-10-11T15:13:30Z

Of note, I'll probably rescope this ticket to address a unique source of memory exhaustion if there are multiple different scenarios, so we can close issues as they're resolved without missing that there's still work to do elsewhere in the app.

iansltx · 2024-10-11T15:49:32Z

Taking a quick look at dogfood, our average host size in the response when including software is 250KB (25MB for 100 hosts). The average climbs to 340KB per host when looking at a smaller set that's a bit more Linux-y, and pulling a single random Linux host's software endpoint (slightly more verbose but same ballpark) hands back ~600KB when I grab all results rather than just the first 20.

So ~100 MB of data on 100 hosts is within the realm of reason.

Looking at the endpoint implementation itself, we have an N+1 when retrieving software, and data structures are built independently for identical pieces of software across multiple hosts. Given that we want to maintain the same response format for this endpoint, my guess is that we could aggregate software/paths, reuse structs that have the same data inside, and fan out to JSON as late as possible in the process. I'd have to profile memory usage to see whether that's actually where the problem is, and my bet is that JSON response building isn't streamed, but we should be able to get RAM usage down here a bit, and also make this endpoint a lot lighter on the DB if we want to support efficiently dispatching this use case.

iansltx · 2024-10-15T14:39:34Z

Based on internal discussions, rescoping this issue to the one we have documentation/logs readily available for, which is the one I'm covering in my comments. I'm sure I'd be able to repro this as a spike in RAM usage (and probably CPU) by grabbing e.g. 500-1000 hosts plus software in a single request (vs. concurrently requesting 100 hosts per team across multiple teams). Just a question of prioritization given that lowering concurrency on the one-off script would have worked around the issue.

If we wind up with other OOMs from this issue, that can push priority here upward. If we wind up with other OOMs for other reasons, those can be ticketed separately, as while we're probably using more RAM now due to e.g. additional MDM crons, a specific OOM will show us where the low-hanging fruit is.

lucasmrod · 2024-10-15T19:03:05Z

Based on internal discussions, rescoping this issue to the one we have documentation/logs readily available for, which is the one I'm covering in my comments.

Sorry, I missed you here (which comments?). Are we updating documentation? (Am ok with that, just checking.)

iansltx · 2024-10-15T20:00:55Z

Chatted with @lucasmrod on this and yep, makes sense to document that this particular endpoint may be RAM-intensive, so users should tread lightly with page size/concurrency when asking for embedded software. Will get a PR up for that here in a bit.

#22291 --------- Co-authored-by: Rachael Shaw <r@rachael.wtf>

rfairburn · 2024-11-04T22:18:40Z

It looks like this may have just happened on dogfood while the vuln dashboard was iterating hosts just now.

The container crashed as it hit that memory spike, so it either ran out of memory or had another error at the time, but memory usage seems the most likely candidate.

Hopefully this is already addressed with what is above. I can't really isolate this to a specific http request, but figured it needed inclusion.

iansltx · 2024-11-04T23:47:43Z

After a bit more investigation, this looks to be due to high-concurrency calls on /hosts/{id}, which is being tracked as part of #23078. @rfairburn feel free to continue dropping OOM issues here as long as this is open and I'll triage them into the proper ticket as needed, either creating new ones or tacking them onto existing tickets as needed. Acceptance criteria for this ticket will remain constrained to the one endpoint/query string combo mentioned in the (edited) ticket title so there's a chance that this gets closed vs. being an ocean-boiling Epic :)

mostlikelee · 2024-11-05T19:58:14Z

a customer also reported this issue using large (or no) pagination sizes on /software/versions

iansltx · 2024-11-09T22:41:43Z

So, #23078 and this ticket have actually converged (just caught up on the customer back-and-forth there). as original scope on this one was determined to be "hosts collection endpoint worh software set to true", and due to various clarifications/discussions #23078 is now scoped as the same thing. /software/versions is its own mess, which is now out of scope for the other issue, and high-concurrency calls to /hosts/{id} is now also out of scope for the other issue.

To deduplicate and avoid things getting lost, I'll get bugs filed for /hosts/{id} (currently only an issue for dogfood) and software/versions (which will probably come up again, potentially sooner rather than later, with the tagged customer), then close this one.

iansltx · 2024-11-10T06:10:50Z

Issues have been split out. Closing this as it's now effectively a duplicate of #23078, which is a P2 so will land this sprint.

fleet-release · 2024-11-10T06:10:53Z

Large software lists,
Now unburdened, Fleet soars high,
Cloud city breathes free.

rfairburn added bug Something isn't working as documented ~released bug This bug was found in a stable release. customer-rocher labels Sep 21, 2024

rfairburn assigned lukeheath Sep 21, 2024

sharon-fdm added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-endpoint-ops Endpoint ops product group :reproduce Involves documenting reproduction steps in the issue labels Sep 23, 2024

sharon-fdm unassigned lukeheath Sep 23, 2024

rfairburn added customer-pingali customer-honoria and removed customer-pingali labels Sep 23, 2024

lukeheath added the P2 Prioritize as urgent label Sep 24, 2024

lukeheath mentioned this issue Sep 25, 2024

Sustained High CPU when scaling loadtest #22367

Open

15 tasks

sharon-fdm assigned lucasmrod Sep 25, 2024

lukeheath removed the P2 Prioritize as urgent label Sep 26, 2024

sharon-fdm unassigned lucasmrod Sep 26, 2024

sharon-fdm assigned iansltx Oct 7, 2024

sharon-fdm added this to the 4.59.0-tentative milestone Oct 7, 2024

iansltx removed customer-honoria :reproduce Involves documenting reproduction steps in the issue labels Oct 15, 2024

iansltx changed the title ~~Fleet can run out of memory even with 6GB Allocated~~ Fleet runs out of memory when listing software in the hosts collection endpoint for several concurrent requests involving hosts with large software lists Oct 15, 2024

iansltx mentioned this issue Oct 15, 2024

Add warning on populate_software query for hosts list endpoint #22945

Merged

rachaelshaw added a commit that referenced this issue Oct 15, 2024

Add warning on populate_software query for hosts list endpoint (#22945)

841d8dc

#22291 --------- Co-authored-by: Rachael Shaw <r@rachael.wtf>

sharon-fdm modified the milestones: 4.59.0, 4.60.0-tentative Oct 28, 2024

mostlikelee mentioned this issue Nov 5, 2024

Clean up database load from API endpoint /api/v1/fleet/hosts #23078

Open

mostlikelee added the customer-faltona label Nov 5, 2024

This was referenced Nov 10, 2024

Calling /hosts/{id} at high concurrency causes Fleet to consume too much RAM #23678

Open

software/versions endpoint is RAM-heavy #23679

Open

iansltx closed this as completed Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fleet runs out of memory when listing software in the hosts collection endpoint for several concurrent requests involving hosts with large software lists #22291

Fleet runs out of memory when listing software in the hosts collection endpoint for several concurrent requests involving hosts with large software lists #22291

rfairburn commented Sep 21, 2024 •

edited by sharon-fdm

Loading

rfairburn commented Sep 21, 2024

sharon-fdm commented Sep 23, 2024 •

edited

Loading

rfairburn commented Sep 23, 2024

lukeheath commented Sep 24, 2024

sharon-fdm commented Sep 24, 2024

lukeheath commented Sep 24, 2024

lukeheath commented Sep 24, 2024

rfairburn commented Sep 26, 2024

sharon-fdm commented Oct 3, 2024

iansltx commented Oct 8, 2024

iansltx commented Oct 11, 2024

iansltx commented Oct 11, 2024 •

edited

Loading

iansltx commented Oct 11, 2024

iansltx commented Oct 15, 2024 •

edited

Loading

lucasmrod commented Oct 15, 2024

iansltx commented Oct 15, 2024

rfairburn commented Nov 4, 2024

iansltx commented Nov 4, 2024

mostlikelee commented Nov 5, 2024

iansltx commented Nov 9, 2024

iansltx commented Nov 10, 2024

fleet-release commented Nov 10, 2024

Fleet runs out of memory when listing software in the hosts collection endpoint for several concurrent requests involving hosts with large software lists #22291

Fleet runs out of memory when listing software in the hosts collection endpoint for several concurrent requests involving hosts with large software lists #22291

Comments

rfairburn commented Sep 21, 2024 • edited by sharon-fdm Loading

💥 Actual behavior

🧑‍💻 Steps to reproduce

🕯️ More info (optional)

rfairburn commented Sep 21, 2024

sharon-fdm commented Sep 23, 2024 • edited Loading

rfairburn commented Sep 23, 2024

lukeheath commented Sep 24, 2024

sharon-fdm commented Sep 24, 2024

lukeheath commented Sep 24, 2024

lukeheath commented Sep 24, 2024

rfairburn commented Sep 26, 2024

sharon-fdm commented Oct 3, 2024

iansltx commented Oct 8, 2024

iansltx commented Oct 11, 2024

iansltx commented Oct 11, 2024 • edited Loading

iansltx commented Oct 11, 2024

iansltx commented Oct 15, 2024 • edited Loading

lucasmrod commented Oct 15, 2024

iansltx commented Oct 15, 2024

rfairburn commented Nov 4, 2024

iansltx commented Nov 4, 2024

mostlikelee commented Nov 5, 2024

iansltx commented Nov 9, 2024

iansltx commented Nov 10, 2024

fleet-release commented Nov 10, 2024

rfairburn commented Sep 21, 2024 •

edited by sharon-fdm

Loading

sharon-fdm commented Sep 23, 2024 •

edited

Loading

iansltx commented Oct 11, 2024 •

edited

Loading

iansltx commented Oct 15, 2024 •

edited

Loading