-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AP_ESC_Telem: fix RPM timeout race #28987
AP_ESC_Telem: fix RPM timeout race #28987
Conversation
The RPM telemetry data structure can get updated by another thread at any time. This can cause (now - last_update_us) to be negative, which looks like the data is nearly UINT32_MAX microseconds stale. To fix this, we need to use signed integers for the time difference. The data_valid flag prevents us from accidentally counting extremely stale data as valid when the signed time difference wraps around, so this is safe.
I don't think this is easily fixable using signed values. The real fix is to introduce locking, but the code is in a hot path so we should avoid that. A simple fix might be to set data_valid to false temporarily in update_rpm, or alternatively have some kind of "in_update" flag which causes the check to fail temporarily. A better fix would be to take a copy of the data before getting the current time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this works, need a thread-safe based solution I think
@@ -811,16 +811,18 @@ void AP_ESC_Telem::update() | |||
const uint32_t now_us = AP_HAL::micros(); | |||
for (uint8_t i = 0; i < ESC_TELEM_MAX_ESCS; i++) { | |||
// Invalidate RPM data if not received for too long | |||
if ((now_us - _rpm_data[i].last_update_us) > ESC_RPM_DATA_TIMEOUT_US) { | |||
// (last_update_us might be fresher than now_us, so we use a signed difference) | |||
if (int32_t(now_us - _rpm_data[i].last_update_us) > ESC_RPM_DATA_TIMEOUT_US) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I don't think this works, because the same comparison happens at the wrap point, and it needs to register a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about when the time difference wraps around? After the difference wraps around (after 36 minutes being stale), this comparison stops setting data_valid
to false, but that's okay because it has been pointlessly re-registering the timeout for 36 minutes and now it just stops setting it to false; it doesn't start setting it to true. And nothing sets it true until fresh data arrives.
Or are you talking about the time when now_us
itself wraps around? Because that also works fine. Integer differences pretty much work fine with wraparound as long as both are the same type.
Agreed; I saw your very nice comment paragraph in the code explaining why locking is a bad idea. Like you said, because this is such a hot path, I wanted to do something with almost zero overhead. For me, it is easier to prove that this logic works than to prove that I don't impact any tight timing requirements anywhere. The fact that it's tricky to tell that this works by inspection is a real flaw of my approach though. I'm open to alternatives, but I have hacked together a little demo to show that this logic works at least. I made an int22_t, which goes through a full cycle every 4.2 seconds (so the time difference wraps after it hits 2.1s), and I made the RPM telemetry switch between updating and not-updating every 5 seconds. Here's a plot of the return value of We can see the value displaying valid for 6 seconds (it's valid for 5 seconds, then times out after 1s of inactivity), then invalid for 4 seconds, just like it should. The Here's the timeout conditional, The validity drops to 0 right when the timeout conditional first transitions to 1, as we expect, but then a second later, the timeout conditional drops to 0 when the time difference wraps around. The flipping of the conditional doesn't matter though, because the validity flag remains false. Here's the actual time difference, as cast to All of these patterns happen exactly the same each iteration, despite the fact that now_us and last_updated are all wrapping around at different times. Their wraparounds don't impact the math for the time difference. Here's On the very left-hand side, we see (I also have a picture putting |
It's just too subtle for my liking and against the pattern we have pretty much everywhere else in the code. If you want to pursue this I think you will need to add helpers and unit tests to AP_HAL::timeout_expired() and friends, but I really don't think this is a good idea. We already have a subtle change in this code to deal with a similar problem from @magicrub and the changes are getting more involved ... |
The RPM telemetry data structure can get updated by another thread at any time. This can cause
(now - last_update_us)
to be negative, which looks like the data is nearlyUINT32_MAX
microseconds stale, causing perfectly fine ESCs to be marked as timed-out. This causes lua scripts to get nil values when they callesc_telem:get_rpm
even when the RPM data is fine. I'd imagine this can cause minor problems elsewhere too, but not sure.There are a few ways we could fix this. Using a signed time difference is the simplest and has the least overhead. We don't need to worry about this signed time difference wrapping around when the rpm is 36 minutes stale, because the
data_valid
flag gets set false as soon as the data is 1s stale, and it can't get set to true until new data comes in.The easiest way to demonstrate the problem (and fix) is with the following lua script running in SITL at 100x speed.
Note
AP_ESC_Telem_Backend::TelemetryData::stale
is just as vulnerable to this, butTelemetryData
doesn't have the same wraparound protection (though it's not as bad since it's using millis), so I can't use this same trick to fix it.