Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(vmm): call KVMCLOCK_CTRL when pausing #4460

Merged
merged 8 commits into from
Oct 24, 2024

Conversation

kalyazin
Copy link
Contributor

@kalyazin kalyazin commented Feb 19, 2024

Changes

Call KVM_KVMCLOCK_CTRL when pausing vCPUs. This allows guest kernel's soft lockup watchdog to know that it was the hypervisor that paused the vCPUs and don't trigger an exception.

Fixes: #1859

Reason

This is to avoid guest kernel panic on resume path due to soft lockup detection.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • [ ] API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • [ ] New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

Copy link

codecov bot commented Feb 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.08%. Comparing base (0b9cf39) to head (d496d05).
Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4460   +/-   ##
=======================================
  Coverage   84.07%   84.08%           
=======================================
  Files         251      251           
  Lines       28052    28060    +8     
=======================================
+ Hits        23586    23594    +8     
  Misses       4466     4466           
Flag Coverage Δ
5.10-c5n.metal 84.71% <100.00%> (-0.01%) ⬇️
5.10-m5n.metal 84.69% <100.00%> (-0.01%) ⬇️
5.10-m6a.metal 84.00% <100.00%> (-0.01%) ⬇️
5.10-m6g.metal 80.70% <100.00%> (+<0.01%) ⬆️
5.10-m6i.metal 84.69% <100.00%> (-0.01%) ⬇️
5.10-m7g.metal 80.70% <100.00%> (+<0.01%) ⬆️
6.1-c5n.metal 84.71% <100.00%> (-0.01%) ⬇️
6.1-m5n.metal 84.69% <100.00%> (-0.01%) ⬇️
6.1-m6a.metal 84.00% <100.00%> (-0.01%) ⬇️
6.1-m6g.metal 80.70% <100.00%> (+<0.01%) ⬆️
6.1-m6i.metal 84.69% <100.00%> (+<0.01%) ⬆️
6.1-m7g.metal 80.70% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kalyazin kalyazin force-pushed the dbg_kvmclock branch 3 times, most recently from 3294d86 to a8ad920 Compare February 19, 2024 16:19
@bchalios
Copy link
Contributor

I created a test that should trigger the guest kernel to detect lockups. I will let the CI run once, so we can make sure that the test works on our CI and then reapply the commits that set the KVM_KVMCLOCK_CTRL bit when we pause vCPUs. These should make the failure go away.

@bchalios
Copy link
Contributor

The test does trigger the lockup: https://buildkite.com/firecracker/firecracker-pr/builds/11546.

@bchalios bchalios force-pushed the dbg_kvmclock branch 2 times, most recently from ea28a13 to 18b2f34 Compare October 22, 2024 15:03
@bchalios
Copy link
Contributor

I've pushed again Nikita's commits that add the call the KVM_KVMCLOCK_CTRL ioctl. Tests should be fixed now.

@bchalios bchalios force-pushed the dbg_kvmclock branch 5 times, most recently from 1d6683a to 886eb2e Compare October 23, 2024 10:30
@bchalios bchalios marked this pull request as ready for review October 23, 2024 11:02
@bchalios bchalios added the Status: Awaiting review Indicates that a pull request is ready to be reviewed label Oct 23, 2024
@bchalios
Copy link
Contributor

The last commit is not related with PR per se. However, it fixes an intermittent issue in the CI which I was hitting in this PR's pipelines runs and I was too lazy to open a separate PR.

@bchalios bchalios changed the title [WIP] fix(vmm): call KVMCLOCK_CTRL when pausing fix(vmm): call KVMCLOCK_CTRL when pausing Oct 23, 2024
src/vmm/src/logger/metrics.rs Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
bchalios and others added 5 commits October 23, 2024 14:06
Launch a script in the guest that continuously calls `ls -R /` and on
the host side, continuously pause and resume the microVM trying to cause
an RCU soft lockup.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
This is to avoid guest kernel panic on resume path
due to softlockup detection.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Signed-off-by: Babis Chalios <bchalios@amazon.es>
This is to be able to call KVMCLOCK_CTRL ioctl in
a vCPU thread.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Signed-off-by: Babis Chalios <bchalios@amazon.es>
Add a metric that counts KVM_KVMCLOCK_CTRL ioctl call failures.
These failures might happen because the guest simply doesn't use the KVM
clock. In these cases, the metric will increase expectedly. Otherwise,
non-zero values will indicate something actually going wrong.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Mention that we now call KVM_KVMCLOCK_CTRL ioctl on x86_64 after pausing
vCPUs. Clarify that failures to call this ioctl are not fatal and that
we log the failure and increase a metric when these happen.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
`pidfd_open` will fail if there is not a process with the requested PID.
According to `man pidfd_open(2)`, it will return EINVAL when `PID` is
not valid and `ESRCH` when the `PID` does not exist. Right now, we were
checking only for the latter condition. Change the logic to also care
for the former, which materializes as an OSError exception with
errno == EINVAL.

Signed-off-by: Babis Chalios <bchalios@amazon.es>
Copy link
Contributor Author

@kalyazin kalyazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I can't approve the change, apparently because the PR originated from me, but it does LGTM.

@bchalios bchalios merged commit 5f73d2b into firecracker-microvm:main Oct 24, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Look into notifying the guest when vcpus are paused
4 participants