host-sp-comms: do not reboot host on boot failure #1618

cbiffle · 2024-02-14T21:52:14Z

This is in response to #1613. If the host reports a boot failure (such as a phase mismatch, but not limited to that reason) simply rebooting it blindly is unlikely to fix the problem. We need intervention from a higher power (the control plane) to fix the issue.

So to avoid a bootloop that wastes energy and overwrites our circular buffers with spam, this change alters the response to the IPCC Request Reboot message if received shortly after a Boot Failed message -- it is interpreted as a power off request.

Fixes #1614.

This is in response to #1613. If the host reports a boot failure (such as a phase mismatch, but not limited to that reason) simply rebooting it blindly is unlikely to fix the problem. We need intervention from a higher power (the control plane) to fix the issue. So to avoid a bootloop that wastes energy and overwrites our circular buffers with spam, this change alters the response to the IPCC Request Reboot message if received shortly after a Boot Failed message -- it is interpreted as a power off request. Fixes #1614.

hawkw · 2024-02-14T22:11:47Z

task/host-sp-comms/src/main.rs

        let response = match request {
            HostToSp::_Unused => {
                Some(SpToHost::DecodeFailure(DecodeFailureReason::Deserialize))
            }
            HostToSp::RequestReboot => {
-                action = Some(Action::RebootHost);
+                if core::mem::take(&mut self.host_boot_just_failed) {


The use of mem::take here is very cute! But, IIUC, since we clear the flag later if action is Some, we don't actually need to clear it when we read the flag's value here, do we? Since both branches set action to Some regardless?

Up to you whether this actually ought to change, but at a first glance it kind of appeared that this was the only case where we clear the flag, so it was a bit less obvious that any branch where action is Some clears the flag (despite the comment). 🤷‍♀️

Yeah, I wound up doing both here to make it really really unlikely that we forget to clear the flag. Arguably unnecessary. What do you think would be clearest?

I think maybe leaving it as is, but commenting that this also gets cleared because we've set action = Some(...)? Honestly, I think it's not really a big deal --- mostly, I wanted to make sure my understanding was correct.

The fact that it threw you suggests that I need to make it more clear, so I'll see about adding some comments.

jgallagher · 2024-02-14T23:04:23Z

task/host-sp-comms/src/main.rs

+
+                // Record that this failure happened, which will change the
+                // behavior of the reboot request we expect to follow shortly.
+                self.host_boot_just_failed = true;


Do we care about the reason (which we're ..'ing out in the match arm), or should all boot failure reasons be treated equally? Maybe a question for @citrus-it

It would be my strong preference to have the same behavior regardless of reported reason, but I'm open to changing that if somebody wants to make an argument!

We don't ever expect to see any boot failures apart from in the lab where they're triggered pretty often in my experience. It just takes somebody to do something like configure net boot and forget to start the boot server on the butler, or start it with a typo in the options. We should probably send a heads-up about this new behaviour.

It is very unlikely that any of the boot failure reasons indicate that there would be a different result if retried, I think they should all be treated equally as requiring intervention.

FWIW, I did just hit this exact case the other day personally because I wasn't fast enough in the lab with net boot/partially distracted. Is there something that makes it very clear from the sequencer state or related that we're in this case as a human looking at the system?

The behavior as implemented in this commit won't affect the sequencer state -- the machine will just power itself down on boot failure and wait to be turned back on. We don't currently have a great way of exposing "the last problem with this was _____" data in production builds. As a user, what sort of experience would you hope for?

jgallagher · 2024-02-14T23:08:04Z

task/host-sp-comms/src/main.rs

-                action = Some(Action::RebootHost);
+                if core::mem::take(&mut self.host_boot_just_failed) {
+                    // Do _not_ reboot. Reinterpret this as a poweroff request.
+                    action = Some(Action::PowerOffHost);


I'm happy to look at this more tomorrow, but we might need to address the TODO I left in handle_jefe_notification. There may be a sequence of reboot operations where we still reboot even if we set this action.

Good to know, let's talk tomorrow.

I think I have convinced myself this is okay (where "this" is really "the handling of self.reboot_state"):

self.reboot_state starts out None

It can only go from None to Some(_) inside a call to power_off_host(true)

Once it goes to Some(_), it will eventually cause the host to come back on through one of these paths:

If power_off_host(true) succeeds in asking the sequencer to go to A2, it's set to Some(WaitingForA2). Once we get a notification from jefe that we've entered A2, it will change to Some(WaitingInA2RebootDelay) and start the Timers::WaitingInA2ToReboot timer. Once that timer fires, we'll enter handle_reboot_waiting_in_a2_timer, set the state to A0, and set self.reboot_state back to `None.

If power_off_host(true) fails to ask the sequencer to go to A2 because we're already in A2, it's set to Some(WaitingInA2RebootDelay) and we start the Timers::WaitingInA2ToReboot timer. Once that timer fires, we'll enter handle_reboot_waiting_in_a2_timer, set the state to A0, and set self.reboot_state back to `None.

power_off_host(true) is only called in two places: if we get a jefe notification that we've gone to A0Reset, or if the host sends us HostToSp::RequestReboot. If either of those happened and before we go through the above timer-based process (~5 seconds) to reboot the host we landed here with a host boot fail, the PowerOffHost would be effectively ignored because we'd already be in the process of rebooting the host. But that doesn't seem likely, and if it happens it's maybe fine to honor the "first request" that triggered the reboot?

Currently, if the host sends back-to-back "power off" and "reboot" requests, we'll behave ... rather strangely. I wonder if it might be worth adding code to stop listening to IPCC entirely starting at any reboot/powerdown request and continuing up until we believe the host to have been powered back on.

cbiffle requested review from jgallagher and citrus-it February 14, 2024 21:52

hawkw reviewed Feb 14, 2024

View reviewed changes

jgallagher reviewed Feb 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

host-sp-comms: do not reboot host on boot failure #1618

host-sp-comms: do not reboot host on boot failure #1618

cbiffle commented Feb 14, 2024

hawkw Feb 14, 2024

cbiffle Feb 14, 2024

hawkw Feb 15, 2024

cbiffle Feb 16, 2024

jgallagher Feb 14, 2024

cbiffle Feb 14, 2024

citrus-it Feb 15, 2024

rmustacc Feb 15, 2024

cbiffle Feb 16, 2024

jgallagher Feb 14, 2024

cbiffle Feb 14, 2024

jgallagher Feb 15, 2024

cbiffle Feb 16, 2024

host-sp-comms: do not reboot host on boot failure #1618

Are you sure you want to change the base?

host-sp-comms: do not reboot host on boot failure #1618

Conversation

cbiffle commented Feb 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment