Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sustained High CPU when scaling loadtest #22367

Open
15 tasks
mostlikelee opened this issue Sep 25, 2024 · 1 comment
Open
15 tasks

Sustained High CPU when scaling loadtest #22367

mostlikelee opened this issue Sep 25, 2024 · 1 comment
Assignees
Labels
~eng-priority Engineering-initiated story that was prioritized. ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature

Comments

@mostlikelee
Copy link
Contributor

Goal

User story
As an engineer running Fleet loadtests
I want to create a stable large loadtest environment in a short amount of time
so that I can receive loadtest feedback quickly and with lower cost.

Context

  • Requestor(s): @mostlikelee
  • Product designer: _________________________

Using the existing script to scale up 100K osquery-perf agents over ~30 min resulted in sustained high Fleet CPU and DB reader CPU. As a workaround, I updated the script to scale up over 2 hours with success. The original cadence worked in the past, so there may be a performance regression when adding too many agents too quickly to Fleet. I imagine this is not affecting large customers because they are not adding agents in this quickly, but it has a high impact on test velocity.

Changes

Product

  • UI changes: TODO
  • CLI (fleetctl) usage changes: TODO
  • YAML changes: TODO
  • REST API changes: TODO
  • Fleet's agent (fleetd) changes: TODO
  • Activity changes: TODO
  • Permissions changes: TODO
  • Changes to paid features or tiers: TODO
  • Other reference documentation changes: TODO
  • Once shipped, requester has been notified

Engineering

  • Feature guide changes: TODO
  • Database schema migrations: TODO
  • Load testing: TODO

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

  • Requires load testing: TODO
  • Risk level: Low / High TODO
  • Risk description: TODO

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. QA (@____): Added comment to user story confirming successful completion of QA.
@mostlikelee mostlikelee added story A user story defining an entire feature ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. labels Sep 25, 2024
@lukeheath lukeheath added :product Product Design department (shows up on 🦢 Drafting board) ~eng-priority Engineering-initiated story that was prioritized. labels Sep 25, 2024
@lukeheath lukeheath assigned sharon-fdm and unassigned lukeheath Sep 25, 2024
@lukeheath
Copy link
Member

@mostlikelee Thanks for filing this. I am prioritizing to the drafting board and assigning to @sharon-fdm for estimation.

so there may be a performance regression when adding too many agents too quickly to Fleet

I'm seeing a series of issues coming through since 4.56.0 with reports of new performance issues (#22291, #22122) that seem related. It's important we dig in and figure out why this is happening before it results in a production issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
~eng-priority Engineering-initiated story that was prioritized. ~engineering-initiated Engineering-initiated story, such as a bug, refactor, or contributor experience improvement. :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Development

No branches or pull requests

3 participants