-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
host-sp-comms: do not reboot host on boot failure #1618
base: master
Are you sure you want to change the base?
Conversation
This is in response to #1613. If the host reports a boot failure (such as a phase mismatch, but not limited to that reason) simply rebooting it blindly is unlikely to fix the problem. We need intervention from a higher power (the control plane) to fix the issue. So to avoid a bootloop that wastes energy and overwrites our circular buffers with spam, this change alters the response to the IPCC Request Reboot message if received shortly after a Boot Failed message -- it is interpreted as a power off request. Fixes #1614.
let response = match request { | ||
HostToSp::_Unused => { | ||
Some(SpToHost::DecodeFailure(DecodeFailureReason::Deserialize)) | ||
} | ||
HostToSp::RequestReboot => { | ||
action = Some(Action::RebootHost); | ||
if core::mem::take(&mut self.host_boot_just_failed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of mem::take
here is very cute! But, IIUC, since we clear the flag later if action
is Some
, we don't actually need to clear it when we read the flag's value here, do we? Since both branches set action
to Some
regardless?
Up to you whether this actually ought to change, but at a first glance it kind of appeared that this was the only case where we clear the flag, so it was a bit less obvious that any branch where action
is Some
clears the flag (despite the comment). 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wound up doing both here to make it really really unlikely that we forget to clear the flag. Arguably unnecessary. What do you think would be clearest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe leaving it as is, but commenting that this also gets cleared because we've set action = Some(...)
? Honestly, I think it's not really a big deal --- mostly, I wanted to make sure my understanding was correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that it threw you suggests that I need to make it more clear, so I'll see about adding some comments.
|
||
// Record that this failure happened, which will change the | ||
// behavior of the reboot request we expect to follow shortly. | ||
self.host_boot_just_failed = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about the reason
(which we're ..
'ing out in the match arm), or should all boot failure reasons be treated equally? Maybe a question for @citrus-it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be my strong preference to have the same behavior regardless of reported reason, but I'm open to changing that if somebody wants to make an argument!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't ever expect to see any boot failures apart from in the lab where they're triggered pretty often in my experience. It just takes somebody to do something like configure net boot and forget to start the boot server on the butler, or start it with a typo in the options. We should probably send a heads-up about this new behaviour.
It is very unlikely that any of the boot failure reasons indicate that there would be a different result if retried, I think they should all be treated equally as requiring intervention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I did just hit this exact case the other day personally because I wasn't fast enough in the lab with net boot/partially distracted. Is there something that makes it very clear from the sequencer state or related that we're in this case as a human looking at the system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior as implemented in this commit won't affect the sequencer state -- the machine will just power itself down on boot failure and wait to be turned back on. We don't currently have a great way of exposing "the last problem with this was _____" data in production builds. As a user, what sort of experience would you hope for?
action = Some(Action::RebootHost); | ||
if core::mem::take(&mut self.host_boot_just_failed) { | ||
// Do _not_ reboot. Reinterpret this as a poweroff request. | ||
action = Some(Action::PowerOffHost); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to look at this more tomorrow, but we might need to address the TODO
I left in handle_jefe_notification
. There may be a sequence of reboot operations where we still reboot even if we set this action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know, let's talk tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have convinced myself this is okay (where "this" is really "the handling of self.reboot_state
"):
self.reboot_state
starts outNone
- It can only go from
None
toSome(_)
inside a call topower_off_host(true)
- Once it goes to
Some(_)
, it will eventually cause the host to come back on through one of these paths:
- If
power_off_host(true)
succeeds in asking the sequencer to go to A2, it's set toSome(WaitingForA2)
. Once we get a notification from jefe that we've entered A2, it will change toSome(WaitingInA2RebootDelay)
and start theTimers::WaitingInA2ToReboot
timer. Once that timer fires, we'll enterhandle_reboot_waiting_in_a2_timer
, set the state to A0, and setself.reboot_state
back to `None. - If
power_off_host(true)
fails to ask the sequencer to go to A2 because we're already in A2, it's set toSome(WaitingInA2RebootDelay)
and we start theTimers::WaitingInA2ToReboot
timer. Once that timer fires, we'll enterhandle_reboot_waiting_in_a2_timer
, set the state to A0, and setself.reboot_state
back to `None.
power_off_host(true)
is only called in two places: if we get a jefe notification that we've gone to A0Reset
, or if the host sends us HostToSp::RequestReboot
. If either of those happened and before we go through the above timer-based process (~5 seconds) to reboot the host we landed here with a host boot fail, the PowerOffHost
would be effectively ignored because we'd already be in the process of rebooting the host. But that doesn't seem likely, and if it happens it's maybe fine to honor the "first request" that triggered the reboot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, if the host sends back-to-back "power off" and "reboot" requests, we'll behave ... rather strangely. I wonder if it might be worth adding code to stop listening to IPCC entirely starting at any reboot/powerdown request and continuing up until we believe the host to have been powered back on.
This is in response to #1613. If the host reports a boot failure (such as a phase mismatch, but not limited to that reason) simply rebooting it blindly is unlikely to fix the problem. We need intervention from a higher power (the control plane) to fix the issue.
So to avoid a bootloop that wastes energy and overwrites our circular buffers with spam, this change alters the response to the IPCC Request Reboot message if received shortly after a Boot Failed message -- it is interpreted as a power off request.
Fixes #1614.