Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Numeric matching to support full range of float64 #188

Merged
merged 6 commits into from
Sep 19, 2024
Merged

Conversation

baldawar
Copy link
Collaborator

@baldawar baldawar commented Sep 18, 2024

Issue #, if available:

179

Description of changes:

This change follows the guidance from #179 on using 10 byte base-128 encoded format for numbers similar to how Quamina does it.

Didn't see any performance implications of supporting the new range, but had to fix a bunch of tests. I will be changing the numbers we use for testing to better test the new range of numbers before merging.

During debugging, I found it challenging to make sense of the numbers to I've also added a helper method in ComparableNumbers and modified toString() methods in few places.

Benchmark / Performance (for source code changes):

Benchmark                                                  (benchmark)   Mode  Cnt       Score       Error  Units
CityLots2JmhBenchmarks.group01Simple                             EXACT  thrpt   10  413930.986 ± 11450.753  ops/s
CityLots2JmhBenchmarks.group01Simple                          WILDCARD  thrpt   10  352480.174 ±  6750.778  ops/s
CityLots2JmhBenchmarks.group01Simple                            PREFIX  thrpt   10  418264.642 ± 19729.068  ops/s
CityLots2JmhBenchmarks.group01Simple   PREFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  415142.905 ± 18319.570  ops/s
CityLots2JmhBenchmarks.group01Simple                            SUFFIX  thrpt   10  419590.039 ± 15463.837  ops/s
CityLots2JmhBenchmarks.group01Simple   SUFFIX_EQUALS_IGNORE_CASE_RULES  thrpt   10  410364.141 ± 12610.649  ops/s
CityLots2JmhBenchmarks.group01Simple                EQUALS_IGNORE_CASE  thrpt   10  382570.869 ±  5316.951  ops/s
CityLots2JmhBenchmarks.group01Simple                           NUMERIC  thrpt   10  256233.479 ±  3662.693  ops/s
CityLots2JmhBenchmarks.group01Simple                      ANYTHING-BUT  thrpt   10  276676.782 ±   958.215  ops/s
CityLots2JmhBenchmarks.group01Simple          ANYTHING-BUT-IGNORE-CASE  thrpt   10  274642.171 ±  5587.937  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-PREFIX  thrpt   10  288814.025 ±  1302.395  ops/s
CityLots2JmhBenchmarks.group01Simple               ANYTHING-BUT-SUFFIX  thrpt   10  284559.927 ±  3258.949  ops/s
CityLots2JmhBenchmarks.group01Simple             ANYTHING-BUT-WILDCARD  thrpt   10  303269.268 ±   922.894  ops/s
CityLots2JmhBenchmarks.group02Complex                   COMPLEX_ARRAYS  thrpt    6   54274.020 ±   567.870  ops/s
CityLots2JmhBenchmarks.group02Complex                    PARTIAL_COMBO  thrpt    6  102749.040 ±  2765.021  ops/s
CityLots2JmhBenchmarks.group02Complex                            COMBO  thrpt    6   35555.689 ±   312.623  ops/s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@baldawar baldawar marked this pull request as ready for review September 19, 2024 00:38
Copy link
Collaborator

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a couple of nonblocking comments to look at

* {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test
* results show that only 5 decimal places of precision can be guaranteed when using doubles.
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue
* where parsing directly to {@code Double} can loose precision when parsing doubles. It's
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"loose" should be "lose"

* results show that only 5 decimal places of precision can be guaranteed when using doubles.
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue
* where parsing directly to {@code Double} can loose precision when parsing doubles. It's
* probably possible to wider ranges with our current implementation of parsing strings to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar, "possible to wider ranges"

throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION +
" and " + ComparableNumber.HALF_TRILLION + ", inclusive");
// make sure we have the comparable numbers and haven't eaten up decimals values
if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen? JSON doesn't allow numeric values of NaN and Inf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not in JSON but there's interest in supporting other formats so I'm bullet proofing this now than be in for a surprise in future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check both in one go based on the bits-representation. With a constant long NOT_FINITE = 0x7ff0L << 52: (bits & NOT_FINITE) == NOT_FINITE will be true for only NaN and +/- Infinity.

Maybe the JVM automagically already inlines the two calls like that. If not, it could help, here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JVM does inline it. I've left this as it for readability for now.

* <br/>
* As shown in Quamina's numbits.go, it's possible to use variable length encoding
* to reduce storage for simple (often common) numbers but it's not done here to
* keep range comparisons simple for now.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, ever since Quamina got variable-width numbits I've been worrying about the range comparison code. sigh now I guess I have to figure that out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I gave it a shot but found it hard to get right. I wanted to get back to building $and support, so took the easier route. I figured I can come back to cleaning up my branch later.

* @param value the long value to be converted
* @return the Base128 encoded string representation of the input value
*/
public static String numbits(long value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each char in a java String is 2 bytes because they are UTF-16 codepoints. You could save a lot of memory if you could work on the UTF-8 bytes. But I realize that this comment applies to all of Ruler I guess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair. It would be worth a look when we dig into memory & cpu improvements down the line.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are Strings-by-chars sorted in Event Ruler? Binary value or with some collation? If it's binary... you can still use the same scheme but store 2 7-bit bytes per char. Might require shuffling the bytes around if the endianness of the char 2-byte-tuples is different. But as long as leaving the 0x8080 bits out stays clear of combiners or possibly causing illegal codepoints... you should be good.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or use the lower 15 bits. simpler than a 0x8080 mask. Still, check that you get by with only one codepoint and that it's not reserved or illegal in UTF16

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing ranges got trickier with using 2 7-bit bytes per char, so I've parked the idea for now. Will come back to it later.

In case anyone takes a stab at it in future, here's the reference code for this :

static final int MAX_LENGTH_IN_BYTES = 5;
...
public static String numbits(long value) {
        char[] result = new char[MAX_LENGTH_IN_BYTES];
        int index = MAX_LENGTH_IN_BYTES - 1;

        while (value != 0 && index >= 0) {
            int lowBits = (int)(value & BASE_128_BITMASK);
            value >>= 7;
            int highBits = (int)(value & BASE_128_BITMASK);
            value >>= 7;

            result[index--] = (char)((highBits << 7) | lowBits);
        }

        return new String(result);
    }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Tim told me quamina internally works different. Quamina needs the UTF8 validity because it uses UTF8-illegal byte values as markers.
Is ruler different enough so you don't depend on that? In that case, you could also just cut it into 4 different shorts. Unless that destroys some string operations you need, e.g by producing hanging combiners etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For range comparison, Ruler has a lookup table that changes incoming numbers into a sequence of digits https://github.com/aws/event-ruler/blob/main/src/main/software/amazon/event/ruler/Range.java#L128 . Changing this requires more effort and I suspect not as beneficial compared to supporting variable length for numbers. So I chose to work on variable length first and then come back to compressing the numbers later (though I'm hopeful that I'd be able to do both once I get more into refactoring this part of the code more)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quamina needs the UTF8 validity because it uses UTF8-illegal byte values as markers.
Is ruler different enough so you don't depend on that

I don't know the answer for this on the top of mind. Will check on it later but I was originally suspecting it should be fine as long as I address any edge cases within the look-up table part of the code.

@baldawar baldawar merged commit 1406b64 into main Sep 19, 2024
4 checks passed
Copy link

@arnehormann arnehormann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baldawar/@timbray my 2cts as promised... ping me if you need more details

* 1. It saves 3 bytes of memory per number compared to decimal representation.
* 2. It is lexicographically comparable, which is useful for maintaining sorted order of numbers.
* 2. It aligns with the radix used for IP addresses.
* We use Base128 encoding offers a compact representation of decimal numbers

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/We use/The used/

* other data types. The higher the number, the lower the accuracy that can be maintained. For example,
* {@code 0.30d - 0.10d = 0.19999999999999998} instead of {@code 0.2d}. When extended to {@code 1e10}, the test
* results show that only 5 decimal places of precision can be guaranteed when using doubles.
* The numbers are first parsed as a Java {@code BigDecimal} as there is a well known issue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a reference or details for that?

Double handling is finicky and has lots of potential foot guns. But I never ran into a loss of precision that's not mediated by decimal-binary conversion issues. If you e.g. use Double.toHexString and the "p" representation of exponents in your numerical tests, you should be able to avoid these issues.
This only applies to where you want to use numbers and not the strings, of course. But at some point, the JSON number strings will be converted to double and you'll run into the same issue.
Unless the JSON parsers also avoid Double. But then, you should use the same code path the JSON parser uses.

You won't lose fidelity. But you'll gain some performance.

Copy link
Collaborator Author

@baldawar baldawar Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the issue with37.807807921694092 when parsing Doubles from strings (so Double.valueOf(...) and when doing double y = 37.807807921694092. Both return 37.80780792169409, so its probably tied to the decimal-binary conversion problems. I wasn't sure that JSON number strings will have the same issue or not as some systems using favor direct string transformations and bypass JSON parsing altogether.

Copy link

@arnehormann arnehormann Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A double represented as a decimal stops being accurate after 15 digits (unless it ends in 5). this number has 17.
A good way to keep it in mind: 10 bits fit 3 decimal digits (1000 < 1024). a double has 52 mantissa bits, so this buys you a fractional amount more than 15 digits with 50 bits. the other 2 (or 3 if you include the implicit one) are not enough for a 16th digit. So it will... distribute... the rest - a fraction more than 8 alternatives - over 10 candidate digits. As powers of two are binary, a number ending in 5 will be at an advantage, here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And BTW this is a problem with the citylots file, those floats in there have more precision than is safe with JSON. So it's probably not useful for supporting numeric-match testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow. I did not realize that. Let me look into changing city lots to have a healthy mix of numbers with less than 15 decimals.

// if high bit is 0, we want to xor with sign bit 1 << 63, else negate (xor with ^0). Meaning,
// bits >= 0, mask = 1000000000000000000000000000000000000000000000000000000000000000
// bits < 0, mask = 1111111111111111111111111111111111111111111111111111111111111111
final long mask = ((bits >>> 63) * 0xFFFFFFFFFFFFFFFFL) | (1L << 63);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raph Levien proposed an even nicer way on Mastodon (not yet in Quamina afaik).
(bits >> 63) | (1L << 63) is sufficient. For negative numbers, the sign bit will be set. >> is sign extending and will fill all the left bits with the sign bit. No need for another constant.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet. Thanks. If you have the link to the tweet (toot?) from Raph, I'd love to link it for attribution.

Copy link

@arnehormann arnehormann Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://mastodon.online/@raph/113071041069390831 ... his was for Go, not Java - but it's the same idea

* @param value the long value to be converted
* @return the Base128 encoded string representation of the input value
*/
public static String numbits(long value) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are Strings-by-chars sorted in Event Ruler? Binary value or with some collation? If it's binary... you can still use the same scheme but store 2 7-bit bytes per char. Might require shuffling the bytes around if the endianness of the char 2-byte-tuples is different. But as long as leaving the 0x8080 bits out stays clear of combiners or possibly causing illegal codepoints... you should be good.

throw new IllegalArgumentException("Value must be between " + -ComparableNumber.HALF_TRILLION +
" and " + ComparableNumber.HALF_TRILLION + ", inclusive");
// make sure we have the comparable numbers and haven't eaten up decimals values
if(Double.isNaN(doubleValue) || Double.isInfinite(doubleValue) ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can check both in one go based on the bits-representation. With a constant long NOT_FINITE = 0x7ff0L << 52: (bits & NOT_FINITE) == NOT_FINITE will be true for only NaN and +/- Infinity.

Maybe the JVM automagically already inlines the two calls like that. If not, it could help, here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants