Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

Merged
merged 3 commits into from
Jan 7, 2025

Conversation

peterbarker
Copy link
Contributor

Co-authored-by: Leonard Hall leonardthall@gmail.com

the cross-product code can produce something slightly negative. Stop passing that into safe_sqrt; it isn't that safe on SITL!

@khancyr
Copy link
Contributor

khancyr commented Jan 1, 2025

Shouldn't be this done in safe_sqrt ? I would assume that we have some other place that can have this issue

@peterbarker
Copy link
Contributor Author

Shouldn't be this done in safe_sqrt ? I would assume that we have some other place that can have this issue

Well... we really want to know about where this actually does happen. This PR isn't really a satisfactory solution to what we're seeing IMO - it would be nice to trace back the maths to work out how the rhs ens up larger than the lhs in here. #28969 is the issue if you missed it

@IamPete1
Copy link
Member

IamPete1 commented Jan 1, 2025

I made a Geogebra to visualise the maths.

https://www.geogebra.org/calculator/ktxy2uyf

I don't see any issue with the maths. Must be floating point stuff.

If I were to guess it would be vel.is_zero(). It checks the individual X and Y components are not zero. This does not mean that the length is none zero (when using floats). Squaring a very small number which is not zero might result in a number which is zero. So you get zeros before the addition and square root that would make the number larger again.

Maybe we need a new method length_is_zero we could compare components against sqrt(FLT_EPSILON) rather than FLT_EPSILON.

@tpwrules

This comment was marked as outdated.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 1, 2025

Pete is close, the problem is imprecision in Vector2f.limit_length() on line 387.

The vector is (99.9791489, -557.290039) and the length limit is 566.187256. length() of the vector is 566.187256 so limit_length() doesn't do anything and the safe_sqrt runs. But length_squared() is 320568.031250 while accel_max*accel_max is 320568.000000, a change of the least significant bit. The difference in these quantities is exactly the value that was passed to safe_sqrt. Lucky shot!

  • We could implement a MAX with 0, that seems okay
  • We could adjust the safe_sqrt call to not square accel_max, but that would require an extra square root
  • We could adjust limit_length to check the squared length, or have an alternate method for where we care about the squared limit

I'm not sure how to guarantee the vector length quantity is actually <= the limit after the math in limit_length() no matter how it's done. I am guessing that could have errors too if they are very close. Anybody have any idea what rounding mode we use?

@tpwrules
Copy link
Contributor

tpwrules commented Jan 1, 2025

Doing some research and fiddling suggests this is the correct solution. We need to audit other uses of limit_length() to make sure they won't explode if the vector ends up slightly over the limit.

It's not feasible due to the vagaries of floating point math to guarantee it's always equal or under.

@IamPete1
Copy link
Member

IamPete1 commented Jan 1, 2025

  • We could adjust limit_length to check the squared length, or have an alternate method for where we care about the squared limit

We could limit to the smaller of the the existing and the squared so we always conservative.

On the whole I tend to think know we know what is going on we probably don't need to panic. I'd even be tempted not to fix this case so we can see if a recent change had made this more likely.

We do have a test for the function, maybe we should add some cases there.

TEST(Vector2Test, length)
{
EXPECT_FLOAT_EQ(25, Vector2f(3, 4).length_squared());
Vector2f v_float1(1.0f, 1.0f);
EXPECT_TRUE(v_float1.limit_length(1.0f));
EXPECT_FALSE(Vector2f(-0.0f, 0.0f).limit_length(1.0f));
}

@peterbarker
Copy link
Contributor Author

Closes #28969

@tpwrules
Copy link
Contributor

tpwrules commented Jan 1, 2025

I don't actually think limit_length() can be entirely fixed unfortunately. I think it's better to just be mindful of its limitations.

And it has happened again here: https://github.com/ArduPilot/ardupilot/actions/runs/12572260649/job/35044375709?pr=28857

@peterbarker peterbarker added the WIP label Jan 2, 2025
@peterbarker peterbarker force-pushed the pr/prevent-fpe-control branch from 6d764d3 to a73e086 Compare January 2, 2025 05:22
@peterbarker
Copy link
Contributor Author

I've written a test for this now so other people can play with things in gdb easily.

I've also pushed up a patch reverting the original fix and proposing an alternative fix. Basically stop using the apples comparison to determine if we do the oranges maths - use a single fruit in both cases.

@peterbarker
Copy link
Contributor Author

And it has happened again here: https://github.com/ArduPilot/ardupilot/actions/runs/12572260649/job/35044375709?pr=28857

Well, that's coincidence... third is presumed to be enemy action...

I've had a quick look through the stack and code changes and haven't found any smoking gun. @robertlong13 changed things up in this test a bit recently be adding in the ICE frames. Is it possible that the slightly different mission is causing this? 3982d57

@tpwrules
Copy link
Contributor

tpwrules commented Jan 2, 2025

So this both fixes the problem and doesn't. The problem is fixed because we won't execute a square root with a negative number; we now trust that the limit function says it wanted to limit the vector and then set accel_dir to 0 instead of calling safe_sqrt.

But it doesn't fix the problem because the vector is still in some sense too long. Calling accel_cross.limit_length_squared(accel_max) again will still return true. The test (len_squared <= max_length_squared) is still false, but len == max_length so x and y get multiplied by precisely one and the vector doesn't change at all despite the function returning that it tried to change it.

I think adding a MAX(0, <old value>) to the safe_sqrt now is the best solution to stop CI flapping, then do some more detailed studies of who uses limit_length and what are the problems for going over.

The current fix is sort of worse because now limit_length_squared is lying. It said it limited the vector but didn't. I haven't been able to prove limit_length can't lie, it's not trivial to make it do so, but I suspect it's possible with subnormals or something.

Is it possible that the slightly different mission is causing this?

Yes, I expect so. It just happens to sometimes hit this exact point now. Here's the enemy action: https://github.com/ArduPilot/ardupilot/actions/runs/12581405966/job/35065761383?pr=28991 . The numbers are again identical.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 4, 2025

Another failure: https://github.com/ArduPilot/ardupilot/actions/runs/12591859248/job/35097592628?pr=28996

Going to try to replicate locally and see if that mission change was the instigator.

Still prefer the MAX fix.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 4, 2025

I was able to replicate it (running only the FlyEachFrame test and quadplane-tilt frame) but sometimes it took 50-60 tries and sometimes it was one or two. It would be hard to say what particular commit did it. I imagine it's kind of dependent on system load and stuff too.

@tpwrules
Copy link
Contributor

tpwrules commented Jan 5, 2025

Now it's infected copter: https://github.com/ArduPilot/ardupilot/actions/runs/12616584539/job/35158079748?pr=28857

accel_cross = {x = 290.641205, y = -485.89682}, accel_max = 566.187256

@peterbarker
Copy link
Contributor Author

Now it's infected copter: https://github.com/ArduPilot/ardupilot/actions/runs/12616584539/job/35158079748?pr=28857

We're probably going to need fire by the end of this.

@peterbarker peterbarker force-pushed the pr/prevent-fpe-control branch from a73e086 to f06ef62 Compare January 5, 2025 23:57
@peterbarker
Copy link
Contributor Author

I've popped off my alternative fix now.

Copy link
Contributor

@tpwrules tpwrules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good to me, thanks! Tested that the test fails when the fix is reverted.

Some super minor nits:

  • never quite sure what "limiting accel, vectors" meant, that could be clearer
  • squash last two commits into one as they are both needed together to fix the test
  • Co-authored-by: tag needs to be the last line in the commit message, after two blank lines for GitHub to pick it up

@peterbarker peterbarker force-pushed the pr/prevent-fpe-control branch from f06ef62 to db167a3 Compare January 6, 2025 06:57
@peterbarker
Copy link
Contributor Author

Fix looks good to me, thanks! Tested that the test fails when the fix is reverted.

Some super minor nits:

* never quite sure what "limiting accel, vectors" meant, that could be clearer

* squash last two commits into one as they are both needed together to fix the test

* `Co-authored-by:` tag needs to be the last line in the commit message, after two blank lines for GitHub to pick it up

All fixed, thanks!

@tridge
Copy link
Contributor

tridge commented Jan 7, 2025

I'd prefer to restore the original intention of safe_sqrt() which is what a negative input is not an error

@IamPete1
Copy link
Member

IamPete1 commented Jan 7, 2025

Looks like this is the same problem and used the same fix as proposed here:

// todo: remove MAX of zero when safe_sqrt fixed
return safe_sqrt(MAX(pos_error.length_squared() - sq(dot_error), 0.0));

@tpwrules tpwrules force-pushed the pr/prevent-fpe-control branch from 3b16f3f to a97c03d Compare January 7, 2025 02:15
@tpwrules
Copy link
Contributor

tpwrules commented Jan 7, 2025

A little clever but should serve the purpose :)

Checked the tests too, but CI will tell all. Should we backport? I don't think it's necessary, unless we broke the 4.6 CI.

@tpwrules tpwrules force-pushed the pr/prevent-fpe-control branch from a97c03d to 84e7344 Compare January 7, 2025 02:28
peterbarker and others added 2 commits January 6, 2025 20:30
The cross-product code can produce something slightly negative. Fix
safe_sqrt to avoid barfing on that, as was originally intended, and
clarify why it's being used.

Co-authored-by: Leonard Hall <leonardthall@gmail.com>
Co-authored-by: Thomas Watson <twatson52@icloud.com>
@tpwrules tpwrules force-pushed the pr/prevent-fpe-control branch from 84e7344 to 5f420e7 Compare January 7, 2025 02:30
if (isnan(ret)) {
return 0;
// cast before checking so we sqrtf the same value we check
const float val = static_cast<float>(v);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, we went through this entire discussion without noting that we're squashing any double which might be passed in....

Copy link
Contributor

@tpwrules tpwrules Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had noticed that. But we do it in several places, I'm not too concerned specifically here. Also having looked at the binary I think it only ever gets called with a float even on double boards.

@peterbarker peterbarker merged commit f236a6c into ArduPilot:master Jan 7, 2025
99 checks passed
@peterbarker peterbarker deleted the pr/prevent-fpe-control branch January 7, 2025 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants