AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

peterbarker · 2025-01-01T10:53:26Z

Co-authored-by: Leonard Hall leonardthall@gmail.com

the cross-product code can produce something slightly negative. Stop passing that into safe_sqrt; it isn't that safe on SITL!

khancyr · 2025-01-01T13:06:44Z

Shouldn't be this done in safe_sqrt ? I would assume that we have some other place that can have this issue

peterbarker · 2025-01-01T14:07:04Z

Shouldn't be this done in safe_sqrt ? I would assume that we have some other place that can have this issue

Well... we really want to know about where this actually does happen. This PR isn't really a satisfactory solution to what we're seeing IMO - it would be nice to trace back the maths to work out how the rhs ens up larger than the lhs in here. #28969 is the issue if you missed it

IamPete1 · 2025-01-01T17:48:37Z

I made a Geogebra to visualise the maths.

https://www.geogebra.org/calculator/ktxy2uyf

I don't see any issue with the maths. Must be floating point stuff.

If I were to guess it would be vel.is_zero(). It checks the individual X and Y components are not zero. This does not mean that the length is none zero (when using floats). Squaring a very small number which is not zero might result in a number which is zero. So you get zeros before the addition and square root that would make the number larger again.

Maybe we need a new method length_is_zero we could compare components against sqrt(FLT_EPSILON) rather than FLT_EPSILON.

tpwrules · 2025-01-01T18:37:14Z

Pete is close, the problem is imprecision in Vector2f.limit_length() on line 387.

The vector is (99.9791489, -557.290039) and the length limit is 566.187256. length() of the vector is 566.187256 so limit_length() doesn't do anything and the safe_sqrt runs. But length_squared() is 320568.031250 while accel_max*accel_max is 320568.000000, a change of the least significant bit. The difference in these quantities is exactly the value that was passed to safe_sqrt. Lucky shot!

We could implement a MAX with 0, that seems okay
We could adjust the safe_sqrt call to not square accel_max, but that would require an extra square root
We could adjust limit_length to check the squared length, or have an alternate method for where we care about the squared limit

I'm not sure how to guarantee the vector length quantity is actually <= the limit after the math in limit_length() no matter how it's done. I am guessing that could have errors too if they are very close. Anybody have any idea what rounding mode we use?

tpwrules · 2025-01-01T20:18:29Z

Doing some research and fiddling suggests this is the correct solution. We need to audit other uses of limit_length() to make sure they won't explode if the vector ends up slightly over the limit.

It's not feasible due to the vagaries of floating point math to guarantee it's always equal or under.

libraries/AP_Math/control.cpp

IamPete1 · 2025-01-01T21:12:27Z

We could adjust limit_length to check the squared length, or have an alternate method for where we care about the squared limit

We could limit to the smaller of the the existing and the squared so we always conservative.

On the whole I tend to think know we know what is going on we probably don't need to panic. I'd even be tempted not to fix this case so we can see if a recent change had made this more likely.

We do have a test for the function, maybe we should add some cases there.

ardupilot/libraries/AP_Math/tests/test_vector2.cpp

Lines 125 to 131 in 5726c03

    
           TEST(Vector2Test, length) 
        
           { 
        
               EXPECT_FLOAT_EQ(25, Vector2f(3, 4).length_squared()); 
        
               Vector2f v_float1(1.0f, 1.0f); 
        
               EXPECT_TRUE(v_float1.limit_length(1.0f)); 
        
               EXPECT_FALSE(Vector2f(-0.0f, 0.0f).limit_length(1.0f)); 
        
           }

peterbarker · 2025-01-01T21:22:17Z

Closes #28969

tpwrules · 2025-01-01T22:36:50Z

I don't actually think limit_length() can be entirely fixed unfortunately. I think it's better to just be mindful of its limitations.

And it has happened again here: https://github.com/ArduPilot/ardupilot/actions/runs/12572260649/job/35044375709?pr=28857

peterbarker · 2025-01-02T05:28:15Z

I've written a test for this now so other people can play with things in gdb easily.

I've also pushed up a patch reverting the original fix and proposing an alternative fix. Basically stop using the apples comparison to determine if we do the oranges maths - use a single fruit in both cases.

peterbarker · 2025-01-02T05:41:44Z

And it has happened again here: https://github.com/ArduPilot/ardupilot/actions/runs/12572260649/job/35044375709?pr=28857

Well, that's coincidence... third is presumed to be enemy action...

I've had a quick look through the stack and code changes and haven't found any smoking gun. @robertlong13 changed things up in this test a bit recently be adding in the ICE frames. Is it possible that the slightly different mission is causing this? 3982d57

tpwrules · 2025-01-02T16:51:56Z

So this both fixes the problem and doesn't. The problem is fixed because we won't execute a square root with a negative number; we now trust that the limit function says it wanted to limit the vector and then set accel_dir to 0 instead of calling safe_sqrt.

But it doesn't fix the problem because the vector is still in some sense too long. Calling accel_cross.limit_length_squared(accel_max) again will still return true. The test (len_squared <= max_length_squared) is still false, but len == max_length so x and y get multiplied by precisely one and the vector doesn't change at all despite the function returning that it tried to change it.

I think adding a MAX(0, <old value>) to the safe_sqrt now is the best solution to stop CI flapping, then do some more detailed studies of who uses limit_length and what are the problems for going over.

The current fix is sort of worse because now limit_length_squared is lying. It said it limited the vector but didn't. I haven't been able to prove limit_length can't lie, it's not trivial to make it do so, but I suspect it's possible with subnormals or something.

Is it possible that the slightly different mission is causing this?

Yes, I expect so. It just happens to sometimes hit this exact point now. Here's the enemy action: https://github.com/ArduPilot/ardupilot/actions/runs/12581405966/job/35065761383?pr=28991 . The numbers are again identical.

tpwrules · 2025-01-04T05:06:26Z

Another failure: https://github.com/ArduPilot/ardupilot/actions/runs/12591859248/job/35097592628?pr=28996

Going to try to replicate locally and see if that mission change was the instigator.

Still prefer the MAX fix.

tpwrules · 2025-01-04T21:32:38Z

I was able to replicate it (running only the FlyEachFrame test and quadplane-tilt frame) but sometimes it took 50-60 tries and sometimes it was one or two. It would be hard to say what particular commit did it. I imagine it's kind of dependent on system load and stuff too.

tpwrules · 2025-01-05T05:41:52Z

Now it's infected copter: https://github.com/ArduPilot/ardupilot/actions/runs/12616584539/job/35158079748?pr=28857

accel_cross = {x = 290.641205, y = -485.89682}, accel_max = 566.187256

peterbarker · 2025-01-05T09:37:31Z

Now it's infected copter: https://github.com/ArduPilot/ardupilot/actions/runs/12616584539/job/35158079748?pr=28857

We're probably going to need fire by the end of this.

peterbarker · 2025-01-05T23:58:05Z

I've popped off my alternative fix now.

tpwrules

Fix looks good to me, thanks! Tested that the test fails when the fix is reverted.

Some super minor nits:

never quite sure what "limiting accel, vectors" meant, that could be clearer
squash last two commits into one as they are both needed together to fix the test
Co-authored-by: tag needs to be the last line in the commit message, after two blank lines for GitHub to pick it up

peterbarker · 2025-01-06T06:58:18Z

Fix looks good to me, thanks! Tested that the test fails when the fix is reverted.

Some super minor nits:

* never quite sure what "limiting accel, vectors" meant, that could be clearer

* squash last two commits into one as they are both needed together to fix the test

* `Co-authored-by:` tag needs to be the last line in the commit message, after two blank lines for GitHub to pick it up

All fixed, thanks!

tridge · 2025-01-07T01:10:12Z

I'd prefer to restore the original intention of safe_sqrt() which is what a negative input is not an error

IamPete1 · 2025-01-07T01:15:38Z

Looks like this is the same problem and used the same fix as proposed here:

ardupilot/libraries/AC_AttitudeControl/AC_PosControl.cpp

Lines 1415 to 1416 in 01345e5

    
           // todo: remove MAX of zero when safe_sqrt fixed 
        
           return safe_sqrt(MAX(pos_error.length_squared() - sq(dot_error), 0.0));

tpwrules · 2025-01-07T02:16:27Z

A little clever but should serve the purpose :)

Checked the tests too, but CI will tell all. Should we backport? I don't think it's necessary, unless we broke the 4.6 CI.

libraries/AP_Math/AP_Math.cpp

The cross-product code can produce something slightly negative. Fix safe_sqrt to avoid barfing on that, as was originally intended, and clarify why it's being used. Co-authored-by: Leonard Hall <leonardthall@gmail.com> Co-authored-by: Thomas Watson <twatson52@icloud.com>

… case

libraries/AP_Math/AP_Math.cpp

peterbarker · 2025-01-07T05:07:25Z

libraries/AP_Math/AP_Math.cpp

-    if (isnan(ret)) {
-        return 0;
+    // cast before checking so we sqrtf the same value we check
+    const float val = static_cast<float>(v);


Wow, we went through this entire discussion without noting that we're squashing any double which might be passed in....

I had noticed that. But we do it in several places, I'm not too concerned specifically here. Also having looked at the binary I think it only ever gets called with a float even on double boards.

This comment was marked as outdated.

Sign in to view

tpwrules reviewed Jan 1, 2025

View reviewed changes

libraries/AP_Math/control.cpp Outdated Show resolved Hide resolved

peterbarker added the WIP label Jan 2, 2025

peterbarker force-pushed the pr/prevent-fpe-control branch from 6d764d3 to a73e086 Compare January 2, 2025 05:22

AP_Math: add test for FPE in control code

f523c88

peterbarker force-pushed the pr/prevent-fpe-control branch from a73e086 to f06ef62 Compare January 5, 2025 23:57

tpwrules approved these changes Jan 6, 2025

View reviewed changes

peterbarker force-pushed the pr/prevent-fpe-control branch from f06ef62 to db167a3 Compare January 6, 2025 06:57

peterbarker added DevCallTopic and removed WIP labels Jan 6, 2025

peterbarker force-pushed the pr/prevent-fpe-control branch from db167a3 to 3b16f3f Compare January 6, 2025 07:00

CraigElder removed the DevCallTopic label Jan 7, 2025

lthall approved these changes Jan 7, 2025

View reviewed changes

tpwrules force-pushed the pr/prevent-fpe-control branch from 3b16f3f to a97c03d Compare January 7, 2025 02:15

peterbarker commented Jan 7, 2025

View reviewed changes

libraries/AP_Math/AP_Math.cpp Outdated Show resolved Hide resolved

tpwrules force-pushed the pr/prevent-fpe-control branch from a97c03d to 84e7344 Compare January 7, 2025 02:28

peterbarker and others added 2 commits January 6, 2025 20:30

AP_Math: tests: we no longer expect an exception when limiting corner…

5f420e7

… case

tpwrules force-pushed the pr/prevent-fpe-control branch from 84e7344 to 5f420e7 Compare January 7, 2025 02:30

peterbarker commented Jan 7, 2025

View reviewed changes

libraries/AP_Math/AP_Math.cpp Outdated Show resolved Hide resolved

peterbarker added the MergeOnCIPass label Jan 7, 2025

peterbarker commented Jan 7, 2025

View reviewed changes

peterbarker merged commit f236a6c into ArduPilot:master Jan 7, 2025
99 checks passed

peterbarker deleted the pr/prevent-fpe-control branch January 7, 2025 05:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

peterbarker commented Jan 1, 2025

khancyr commented Jan 1, 2025

peterbarker commented Jan 1, 2025

IamPete1 commented Jan 1, 2025

This comment was marked as outdated.

tpwrules commented Jan 1, 2025 •

edited

Loading

tpwrules commented Jan 1, 2025

IamPete1 commented Jan 1, 2025

peterbarker commented Jan 1, 2025

tpwrules commented Jan 1, 2025

peterbarker commented Jan 2, 2025

peterbarker commented Jan 2, 2025

tpwrules commented Jan 2, 2025 •

edited

Loading

tpwrules commented Jan 4, 2025

tpwrules commented Jan 4, 2025

tpwrules commented Jan 5, 2025

peterbarker commented Jan 5, 2025

peterbarker commented Jan 5, 2025

tpwrules left a comment •

edited

Loading

peterbarker commented Jan 6, 2025

tridge commented Jan 7, 2025

IamPete1 commented Jan 7, 2025

tpwrules commented Jan 7, 2025 •

edited

Loading

peterbarker Jan 7, 2025

tpwrules Jan 7, 2025 •

edited

Loading

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

AP_Math: prevent FPE in SITL when limitting accel, vectors #28981

Conversation

peterbarker commented Jan 1, 2025

khancyr commented Jan 1, 2025

peterbarker commented Jan 1, 2025

IamPete1 commented Jan 1, 2025

This comment was marked as outdated.

tpwrules commented Jan 1, 2025 • edited Loading

tpwrules commented Jan 1, 2025

IamPete1 commented Jan 1, 2025

peterbarker commented Jan 1, 2025

tpwrules commented Jan 1, 2025

peterbarker commented Jan 2, 2025

peterbarker commented Jan 2, 2025

tpwrules commented Jan 2, 2025 • edited Loading

tpwrules commented Jan 4, 2025

tpwrules commented Jan 4, 2025

tpwrules commented Jan 5, 2025

peterbarker commented Jan 5, 2025

peterbarker commented Jan 5, 2025

tpwrules left a comment • edited Loading

Choose a reason for hiding this comment

peterbarker commented Jan 6, 2025

tridge commented Jan 7, 2025

IamPete1 commented Jan 7, 2025

tpwrules commented Jan 7, 2025 • edited Loading

peterbarker Jan 7, 2025

Choose a reason for hiding this comment

tpwrules Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

tpwrules commented Jan 1, 2025 •

edited

Loading

tpwrules commented Jan 2, 2025 •

edited

Loading

tpwrules left a comment •

edited

Loading

tpwrules commented Jan 7, 2025 •

edited

Loading

tpwrules Jan 7, 2025 •

edited

Loading