-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spurious update failures #1908
Comments
We saw this issue on dogfood too(!!!!!) and I did some experiments there. Power cycling via ignition ( I'm moderately suspicious this is an error that takes an extended time to show up but I think this at least gives a path forward to do updates. |
It turns out igntition will work. I used manual power off/power on and gave it an extra few seconds. We also saw this on a
|
For reference if we poke a different area in the system region we get the error code that matches
|
Turns out if you look in the latest Cortex M7 manual https://developer.arm.com/documentation/ddi0489/f/memory-system/speculative-accesses/considerations-for-system-design the chip does speculate. We're going with #1905 as our workaround. |
On both colo and dogfood, we've seen SP update failures when updating to R11.
https://github.com/oxidecomputer/colo/issues/88
Logs and Hubris dumps are in
/staff/rack3/BRM42220064/2024-10-18
This failure is common, but does not occur 100% of the time. When force-updating from R11 to R11 on the bench, @jgallagher and @labbott could not reproduce the issue.
The failure logs consistently show the same thing:
This represents
SpCommsError::UpdateFailed(UpdateError::CommunicationError(CommunicationError::SpError(SpError::UpdatedFailed(7)))
The
7
code is not strongly typed, but from auditing error types that get cast into theu32
, it's most likelyUpdateError::ReadProtErr
(this is Hubris's internalUpdateError
type, not the MGSUpdateError
).This agrees with the ringbuf, which shows no progress after
EraseEnd
:At the end of
bank_erase
, Hubris checks the status flags for bank 2 (bank2_status
) and returns an error if any of them are set.In other words, it seems likely that the RDPERR bit is set in the bank 2 status bits.
We never enable read protection, so it's unclear how this bit could end up being set.
Spontaneously set flags have been reported on the ST forums and among other embedded OSes.
The Zephyr issue at zephyrproject-rtos/zephyr#60449 is a good summary.
Zephyr manages to see this issue by just sleeping (see
main.c
). Note that the sleep syscall goes into the kernel, so there's stuff happening under the hood, but not much!In the ST forum, the issue is diagnosed as follows:
This raises more questions than answers:
RDPERR
, notRDSERR
. Do they have the same root case?Zephyr eliminated the error by dedicating an MPU region to system memory, which is evidence for this theory. It's unclear whether that would be feasible for us (some of our tasks are already using every MPU region), or whether we should expect our usual memory protection to have the same effect.
We can kinda reproduce the issue by issuing reads to system memory using
humility readmem
.Here's an example of reading from system flash (
0x1FF02000
) then seeing a flag set inFLASH_SR2
(0x52002110
):Note that this sets the
RDSERRIE
, not theRDPERR
bit, so it's not quite the same as our test.There are two flash status registers:
FLASH_SR1
andFLASH_SR2
. The error flag is set in eitherFLASH_SR1
orFLASH_SR2
, depending on whether we are running in bank-swapped mode; this is one of the few cases where bank swapping is visible. This also means that our Hubris check is wrong, because it always looks atFLASH_SR2
.Running the same test after switching into the other bank, the flag is set in
FLASH_SR1
(0x52002010
) instead ofFLASH_SR2
:Miscellaneous observations
The Proprietary code readout protection functionality is noted to raise error flags without generating bus errors:
RM0433, § 4.5.4
We're not using it, but who knows!
The text was updated successfully, but these errors were encountered: