Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Fleet Detection Plugin #6151

Merged
merged 17 commits into from
Dec 3, 2024
Merged

Add Fleet Detection Plugin #6151

merged 17 commits into from
Dec 3, 2024

Conversation

jonathanrainer
Copy link
Contributor

Adds an initial plugin, that loads at startup and emits metrics for three simple cases: cpus, cpu_freq and total_memory. Not sure this is the correct approach for this, especially as this plugin will expand over time so willing to take any pointers in that regard.

Tested via using OpenTelemetry Collector with the following router config and OTEL config

telemetry:
  apollo:
    experimental_otlp_endpoint: "http://0.0.0.0:4317"
  instrumentation:
    spans:
      mode: "spec_compliant"
receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  debug:
    verbosity: detailed

processors:
  batch:
  filter:
    metrics:
      include:
        match_type: regexp
        metric_names:
          - apollo.router.instance.*

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [filter]
      exporters: [debug]

And produced the following

Screenshot 2024-10-15 at 14 09 42


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@jonathanrainer jonathanrainer requested review from a team as code owners October 15, 2024 13:11
@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Oct 15, 2024

✅ Docs Preview Ready

No new or changed pages found.

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented Oct 15, 2024

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

apollo-router/src/plugins/fleet_detector.rs Outdated Show resolved Hide resolved
apollo-router/src/plugins/fleet_detector.rs Outdated Show resolved Hide resolved
apollo-router/src/plugins/fleet_detector.rs Outdated Show resolved Hide resolved
@jonathanrainer jonathanrainer force-pushed the jr/task/FLEET-19 branch 2 times, most recently from 7e971cf to cad95b7 Compare October 16, 2024 06:23
Copy link
Contributor

@bnjjj bnjjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last but not least, could you make sure to make a perf test with this plugin enabled. We have a benchmark system in place called router-scale (check docs in our own confluence space), it would be great to create a flamegraph with these benchmarks to make sure it's not something that takes a lot of resources. Feel free to ask if you need help

Comment on lines 3565 to 3590
"IntrospectionMode": {
"description": "Which implementation of GraphQL schema introspection to use, if enabled",
"oneOf": [
{
"description": "Use the new Rust-based implementation.",
"enum": [
"new"
],
"type": "string"
},
{
"description": "Use the old JavaScript-based implementation.",
"enum": [
"legacy"
],
"type": "string"
},
{
"description": "Use Rust-based and Javascript-based implementations side by side, logging warnings if the implementations disagree.",
"enum": [
"both"
],
"type": "string"
}
]
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it part of this PR ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we kept having CI failures because of changes to the JSON schema for the config I think? And I thought this would fix it but it doesn't! Any ideas much appreciated!

apollo-router/src/plugins/fleet_detector.rs Outdated Show resolved Hide resolved
apollo-router/src/plugins/fleet_detector.rs Outdated Show resolved Hide resolved
@jonathanrainer
Copy link
Contributor Author

jonathanrainer commented Oct 29, 2024

Ok @bnjjj I've run the perf tests, building the router from this branch and further enabling metrics so that we can see these now being emitted. I'll upload the top file for the router because that's probably the most instructive, I see the memory usage goes up by about 200Mi over the entire course of the test but I don't have any baselines so its hard to compare. I'll also attach the router logs. That doesn't seem like a terrible thing to me but it's hard to compare without a baseline.

I have a few further questions as well to push us forward on this:

  1. We want to ensure that customers can turn this off, and we were thinking of re-using the APOLLO_TELEMETRY_DISABLED env var, are you folks happy with that?
  2. We'll need to add documentation around what's being collected and how to turn it off, should we do that on this branch too for documentation or is it better to do that separately?
  3. The CI on this branch seems to be failing because of generating the config schema, and try as I might I can't see how that might be fixed, any pointers?

otel.log
router.log
top.router.txt

@jonathanrainer jonathanrainer requested a review from bnjjj October 29, 2024 10:28
@jonathanrainer jonathanrainer force-pushed the jr/task/FLEET-19 branch 2 times, most recently from 25ce977 to 492ab28 Compare November 1, 2024 10:50
Copy link
Contributor

@bnjjj bnjjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonathanrainer to make a comparison could you just run the same benchmark on dev and you'll get something to compare with this branch

Comment on lines 65 to 60
// We have to store a reference to the gauge otherwise it will be dropped once the plugin is
// initialised, even though it still has data to emit
freq_gauge: ObservableGauge<u64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BrynCooke Could you confirm it will handle the hot reload properly ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I believe it will, but do test

@jonathanrainer
Copy link
Contributor Author

jonathanrainer commented Nov 6, 2024

@bnjjj Ah yes, apologies should have thought of that, have done that below and redid the tests I posted above just so the comparison is easier. From looking at the memory figures it looks like the plugin does increase memory but it's not the constant increase we were seeing before, also the baseline it starts from in the test appears higher in the branched case which presumably isn't plugin related because it won't be running in the early part of the test (I imagine). Have attached the logs again, let me know if you think there's anything we need to worry about

Dev Files
otel_dev.log
router_dev.log
top.router_dev.txt

Branch Files
otel_branch.log
router_branch.log
top.router_branch.txt

@jonathanrainer jonathanrainer requested a review from bnjjj November 6, 2024 10:34
Copy link
Contributor

@bnjjj bnjjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jonathanrainer LGTM

Copy link
Contributor

@BrynCooke BrynCooke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU and memory should be gauges. We also talked about activate.

@abernix
Copy link
Member

abernix commented Nov 11, 2024

@BrynCooke I think this might be ready for another review from you?

if quota == "-1" {
system_cpus
} else {
quota.parse::<u64>().unwrap() / period.parse::<u64>().unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, are we feeling confident about these unwrap ?

} else {
// If it's not max then divide the two to get an integer answer
let (a, b) = readings.split_once(' ').unwrap();
a.parse::<u64>().unwrap() / b.parse::<u64>().unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, are we feeling confident about these unwrap ?

Adds an initial plugin, that loads at startup and emits metrics
for three simple cases: cpus, cpu_freq and total_memory.
Also improve how we handle refreshing the System
object, rather than doing it in either the callback
or the async task, contain that within a struct
and do it there.
Ensures that gauges will survive hot reload
1. activate is made non-async. It must never fail and it must complete. async fns can halt execution before completion.
2. The spawned thread and channel for fleet detector is removed. There's no need for these. Gauges will be called periodically for export.
3. Telemetry is converted to PrivatePlugin to allow uniform calls to activate.
@BrynCooke
Copy link
Contributor

@jonathanrainer I've pushed a commit that makes things simpler. In particular activate is now handled uniformly and the signature is changed to ensure that activate will complete.

Please check that you are happy with the change and also do some manual testing to ensure that things still work.

@jonathanrainer jonathanrainer merged commit e8cf512 into dev Dec 3, 2024
13 checks passed
@jonathanrainer jonathanrainer deleted the jr/task/FLEET-19 branch December 3, 2024 09:37
@BrynCooke BrynCooke mentioned this pull request Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants