-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NPM] Use IMDSv2 for vpc_id (network_id) lookups #29027
Conversation
Regression DetectorRegression Detector ResultsRun ID: 603dfbd1-cd4e-4222-aa5a-c369e107d910 Metrics dashboard Target profiles Baseline: 6b60c2c Performance changes are noted in the perf column of each table:
No significant changes in experiment optimization goalsConfidence level: 90.00% There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | +1.10 | [+0.31, +1.89] | 1 | Logs |
➖ | basic_py_check | % cpu utilization | +0.76 | [-1.96, +3.48] | 1 | Logs |
➖ | otel_to_otel_logs | ingress throughput | +0.09 | [-0.72, +0.90] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.00, +0.00] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | pycheck_lots_of_tags | % cpu utilization | -0.52 | [-3.07, +2.03] | 1 | Logs |
➖ | idle | memory utilization | -0.71 | [-0.75, -0.67] | 1 | Logs |
➖ | file_tree | memory utilization | -1.19 | [-1.31, -1.08] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | -2.71 | [-15.07, +9.65] | 1 | Logs |
Bounds Checks
perf | experiment | bounds_check_name | replicates_passed |
---|---|---|---|
✅ | idle | memory_usage | 10/10 |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv create-vm --pipeline-id=43752296 --os-family=ubuntu Note: This applies to commit 3e0bc69 |
pkg/util/ec2/network_test.go
Outdated
w.WriteHeader(http.StatusNotFound) | ||
switch r.Method { | ||
case http.MethodPut: // token request | ||
io.WriteString(w, "AQAAAFKw7LyqwVmmBMkqXHpDBuDWw2GnfGswTHi2yiIOGvzD7OMaWw==") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you add a comment explaining where you got this/confirming that this is in fact safe to check in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from another test :) I will refactor this to a constant so it's clearer. I'm guessing it's a fake token, these tests don't need a real one, just any string (since the IMDS server is local as well and doesn't do auth)
if err != nil { | ||
return "", err | ||
} | ||
return id, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should cache the id here to avoid another call to the metadata endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe either way this will only get called once, since this is hit via the Init()
method of the PA check which happens only at startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although if multiple checks are calling this it will indeed be called multiple times, so we can save 2 RPCs by caching this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upon further reflection I think I would rather not cache this, because:
- we are maintaining the same call pattern as before
- we are only calling max 3 times on startup (if process/container/network checks are all enabled)
- anecdotally I have seen a failure to fetch the IMDSv2 token sometimes when testing. I am going to follow up with ASC about this, but I am wary to not populate this value if the first request fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for ASC owned files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved for (1) wkit file
/merge |
🚂 MergeQueue: pull request added to the queue The median merge time in Use |
What does this PR do?
new EC2 instances created will now have IMDSv1 disabled by default. This PR shifts the process-agent / system-probe usage of the IMDS endpoint for vpc_id to v2. Additionally, because the IMDSv2 endpoint validates the TCP hop count of the request, we fall back to an explicit call to system-probe which makes the request using the rootNetNS. In containerized environments this will ensure the hop count is 1.
Motivation
Additional Notes
Testing
This was tested on multiple EC2 instances in the sandbox environment. They both had IMDSv2 required. One had a hop count limit of 1, the other 2.
The agent was run directly on both hosts via a
deb
package. This worked on both instances without the fallback to system-probe's new endpointNext it was run via docker on the host:
This simulated a more typical containerized environment. In this case on the system with a hop count limit of 2, nothing changed and the initial IMDS request works. On the instance with the hop count limit of 1, the fallback is triggered as the initial request fails. The fallback does work and there are no error logs indicating failure to fetch the vpc_id in sysprobe or in process-agent.
Possible Drawbacks / Trade-offs
Describe how to test/QA your changes