Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fleet runs out of memory when listing software in the hosts collection endpoint for several concurrent requests involving hosts with large software lists #22291

Closed
rfairburn opened this issue Sep 21, 2024 · 22 comments
Assignees
Labels
bug Something isn't working as documented customer-faltona customer-rocher #g-endpoint-ops Endpoint ops product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~released bug This bug was found in a stable release.
Milestone

Comments

@rfairburn
Copy link
Contributor

rfairburn commented Sep 21, 2024

Fleet version: v4.56.0 (but also at least the previous 2 versions)

Web browser and operating system: N/A


💥  Actual behavior

Fleet container crashed when 6GB allocated.

Details:

image

image

image

image

The APM results show increased container CPU and request latency around the time of the issue. There are a number of http requests around the time of the issue to retrieve hosts in batches of 100 that were sending over 100MB of data each. It is possible that these could be related. See private slack thread https://fleetdm.slack.com/archives/C03EG80BM2A/p1726910791814149?thread_ts=1726909537.718839&cid=C03EG80BM2A for details (Contains customer data so not pasted here)

🧑‍💻  Steps to reproduce

Unknown, but it appears that iterating over hosts (85-100k) with a lot of software and grabbing that could be at least partially at play.

🕯️ More info (optional)

I haven't been able to isolate this to a specific API call. It does not currently appear to be uploading software installers.

Hopefully this can be reproduced in load testing, even if it takes using 4GB instead of 6 (as we recommend 4GB in all cases per container).

Scope:
@lukeheath, @rfairburn, I am suggesting this scope. Feel free to comment.

  • Create an environment and reproduce.
  • Find the rootcause.
  • Suggest fix (or fix if simple)
@rfairburn rfairburn added bug Something isn't working as documented ~released bug This bug was found in a stable release. customer-rocher labels Sep 21, 2024
@rfairburn
Copy link
Contributor Author

Additional thought. Do we as a practice need to be passing in GOMEMLIMIT assuming that Go is not automatically detecting the right memory amount and therefore trying to allocate because it thinks it can? Might help improve garbage collection and prevent this.

@sharon-fdm sharon-fdm added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-endpoint-ops Endpoint ops product group :reproduce Involves documenting reproduction steps in the issue labels Sep 23, 2024
@sharon-fdm
Copy link
Collaborator

sharon-fdm commented Sep 23, 2024

Assigning to Endpoint Ops team for further investigation.
Unassigning @lukeheath

@rfairburn
Copy link
Contributor Author

This also happened to another cloud customer with 4GB of memory allocated. This one had less than 3k hosts. Adding the customer label

@lukeheath
Copy link
Member

@sharon-fdm I'm elevating this to a P2 because we're seeing an upward trend in memory usage since 4.56.0.

@sharon-fdm
Copy link
Collaborator

@lukeheath, We will try to assign someone this sprint with some risk that we can't get to it.

@lukeheath
Copy link
Member

@georgekarrv Getting this on your radar, as well, in case anyone on MDM has capacity.

@lukeheath
Copy link
Member

@rfairburn Is this bug related to #22122?

@rfairburn
Copy link
Contributor Author

Not that I know of.

@lukeheath lukeheath removed the P2 Prioritize as urgent label Sep 26, 2024
@sharon-fdm
Copy link
Collaborator

Hey team! Please add your planning poker estimate with Zenhub @mostlikelee @iansltx @lucasmrod @getvictor

@sharon-fdm sharon-fdm added this to the 4.59.0-tentative milestone Oct 7, 2024
@iansltx
Copy link
Member

iansltx commented Oct 8, 2024

@rfairburn Have we seen any further incidence of this issue since last reported?

@iansltx
Copy link
Member

iansltx commented Oct 11, 2024

Looking at the original event (with a bit of extra context not posted here, for one of the tagged customers), this particular CPU/RAM usage spike appears to have been due to an enumeration script that pulled the following for all teams:

GET /api/v1/fleet/hosts?page=0&per_page=100&populate_software=True&team_id={id}

Those requests were pretty concurrent, and we returned large payloads (50-105 MB) on those responses. My bet is that a large number of hosts times a large number of software packages per host (with software packages needing to be repeated per host) times a large number of teams being hit with the same request pushed RAM usage over the edge, and probably spun the database a bit as well.

Going to take a quick look at how this endpoint works/how data shape (host count, package count) affects response payload size. Will wait for confirmation that this is the most significant RAM issue we've seen (and wait for some more supporting info on the shape of the data we're working with) before digging super deep here, but my hunch is that we either need to optimize the existing hosts endpoint when software is expanded, or add an endpoint/query option that allows us to deliver this data more efficiently.

@iansltx
Copy link
Member

iansltx commented Oct 11, 2024

Of note, I'll probably rescope this ticket to address a unique source of memory exhaustion if there are multiple different scenarios, so we can close issues as they're resolved without missing that there's still work to do elsewhere in the app.

@iansltx
Copy link
Member

iansltx commented Oct 11, 2024

Taking a quick look at dogfood, our average host size in the response when including software is 250KB (25MB for 100 hosts). The average climbs to 340KB per host when looking at a smaller set that's a bit more Linux-y, and pulling a single random Linux host's software endpoint (slightly more verbose but same ballpark) hands back ~600KB when I grab all results rather than just the first 20.

So ~100 MB of data on 100 hosts is within the realm of reason.

Looking at the endpoint implementation itself, we have an N+1 when retrieving software, and data structures are built independently for identical pieces of software across multiple hosts. Given that we want to maintain the same response format for this endpoint, my guess is that we could aggregate software/paths, reuse structs that have the same data inside, and fan out to JSON as late as possible in the process. I'd have to profile memory usage to see whether that's actually where the problem is, and my bet is that JSON response building isn't streamed, but we should be able to get RAM usage down here a bit, and also make this endpoint a lot lighter on the DB if we want to support efficiently dispatching this use case.

@iansltx iansltx removed customer-honoria :reproduce Involves documenting reproduction steps in the issue labels Oct 15, 2024
@iansltx iansltx changed the title Fleet can run out of memory even with 6GB Allocated Fleet runs out of memory when listing software in the hosts collection endpoint for several concurrent requests involving hosts with large software lists Oct 15, 2024
@iansltx
Copy link
Member

iansltx commented Oct 15, 2024

Based on internal discussions, rescoping this issue to the one we have documentation/logs readily available for, which is the one I'm covering in my comments. I'm sure I'd be able to repro this as a spike in RAM usage (and probably CPU) by grabbing e.g. 500-1000 hosts plus software in a single request (vs. concurrently requesting 100 hosts per team across multiple teams). Just a question of prioritization given that lowering concurrency on the one-off script would have worked around the issue.

If we wind up with other OOMs from this issue, that can push priority here upward. If we wind up with other OOMs for other reasons, those can be ticketed separately, as while we're probably using more RAM now due to e.g. additional MDM crons, a specific OOM will show us where the low-hanging fruit is.

@lucasmrod
Copy link
Member

Based on internal discussions, rescoping this issue to the one we have documentation/logs readily available for, which is the one I'm covering in my comments.

Sorry, I missed you here (which comments?). Are we updating documentation? (Am ok with that, just checking.)

@iansltx
Copy link
Member

iansltx commented Oct 15, 2024

Chatted with @lucasmrod on this and yep, makes sense to document that this particular endpoint may be RAM-intensive, so users should tread lightly with page size/concurrency when asking for embedded software. Will get a PR up for that here in a bit.

@rfairburn
Copy link
Contributor Author

It looks like this may have just happened on dogfood while the vuln dashboard was iterating hosts just now.

image

The container crashed as it hit that memory spike, so it either ran out of memory or had another error at the time, but memory usage seems the most likely candidate.

Hopefully this is already addressed with what is above. I can't really isolate this to a specific http request, but figured it needed inclusion.

@iansltx
Copy link
Member

iansltx commented Nov 4, 2024

After a bit more investigation, this looks to be due to high-concurrency calls on /hosts/{id}, which is being tracked as part of #23078. @rfairburn feel free to continue dropping OOM issues here as long as this is open and I'll triage them into the proper ticket as needed, either creating new ones or tacking them onto existing tickets as needed. Acceptance criteria for this ticket will remain constrained to the one endpoint/query string combo mentioned in the (edited) ticket title so there's a chance that this gets closed vs. being an ocean-boiling Epic :)

@mostlikelee
Copy link
Contributor

a customer also reported this issue using large (or no) pagination sizes on /software/versions

@iansltx
Copy link
Member

iansltx commented Nov 9, 2024

So, #23078 and this ticket have actually converged (just caught up on the customer back-and-forth there). as original scope on this one was determined to be "hosts collection endpoint worh software set to true", and due to various clarifications/discussions #23078 is now scoped as the same thing. /software/versions is its own mess, which is now out of scope for the other issue, and high-concurrency calls to /hosts/{id} is now also out of scope for the other issue.

To deduplicate and avoid things getting lost, I'll get bugs filed for /hosts/{id} (currently only an issue for dogfood) and software/versions (which will probably come up again, potentially sooner rather than later, with the tagged customer), then close this one.

@iansltx
Copy link
Member

iansltx commented Nov 10, 2024

Issues have been split out. Closing this as it's now effectively a duplicate of #23078, which is a P2 so will land this sprint.

@iansltx iansltx closed this as completed Nov 10, 2024
@fleet-release
Copy link
Contributor

Large software lists,
Now unburdened, Fleet soars high,
Cloud city breathes free.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as documented customer-faltona customer-rocher #g-endpoint-ops Endpoint ops product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~released bug This bug was found in a stable release.
Development

No branches or pull requests

7 participants