Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance json index to support regexp and range predicate evaluation #12568

Merged
merged 6 commits into from
Mar 8, 2024

Conversation

saurabhd336
Copy link
Contributor

@saurabhd336 saurabhd336 commented Mar 6, 2024

Added support for REGEXP_LIKE and RANGE predicate evaluation on a json path's values using json index

Release notes

  1. Find all matching docs where a given json key matches a regex
SELECT * FROM table WHERE JSON_MATCH(json_column, 'REGEXP_LIKE("<jsonPath>", '<regexp>')
  1. Find all matching docs where a given json key satisfies a range predicate
SELECT * FROM table WHERE JSON_MATCH(json_column, '"<jsonPath>" > <value>')

@codecov-commenter
Copy link

codecov-commenter commented Mar 6, 2024

Codecov Report

Attention: Patch coverage is 69.78417% with 42 lines in your changes are missing coverage. Please review.

Project coverage is 61.73%. Comparing base (59551e4) to head (5114f2e).
Report is 79 commits behind head on master.

Files Patch % Lines
...local/realtime/impl/json/MutableJsonIndexImpl.java 74.13% 8 Missing and 7 partials ⚠️
...main/java/org/apache/pinot/spi/data/FieldSpec.java 7.69% 11 Missing and 1 partial ⚠️
...t/index/readers/json/ImmutableJsonIndexReader.java 77.55% 6 Missing and 5 partials ⚠️
...ot/common/request/context/RequestContextUtils.java 81.25% 3 Missing ⚠️
...mmon/request/context/predicate/RangePredicate.java 66.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #12568      +/-   ##
============================================
- Coverage     61.75%   61.73%   -0.02%     
- Complexity      207      211       +4     
============================================
  Files          2436     2451      +15     
  Lines        133233   133853     +620     
  Branches      20636    20752     +116     
============================================
+ Hits          82274    82639     +365     
- Misses        44911    45106     +195     
- Partials       6048     6108      +60     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 61.69% <69.78%> (-0.02%) ⬇️
java-21 61.60% <69.78%> (-0.02%) ⬇️
skip-bytebuffers-false 61.72% <69.78%> (-0.03%) ⬇️
skip-bytebuffers-true 61.58% <69.78%> (+33.85%) ⬆️
temurin 61.73% <69.78%> (-0.02%) ⬇️
unittests 61.73% <69.78%> (-0.02%) ⬇️
unittests1 46.89% <11.51%> (+<0.01%) ⬆️
unittests2 27.68% <58.27%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@saurabhd336
Copy link
Contributor Author

cc: @itschrispeck and @chenboat for reviewing changes related to mutable json index moving to TreeMap since it changes the implementation of json_extract_index functions. IMO, iterating over all the keys of the mutable json index's map is wasteful, and we can do better by filtering the keys to iterate over by using a TreeMap. The implementation of range and regexp evaluation also benefits from this move.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!


Pattern pattern = ((RegexpLikePredicate) predicate).getPattern();
MutableRoaringBitmap result = null;
char nextChar = BaseJsonIndexCreator.KEY_VALUE_SEPARATOR + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) We can make this char constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Pattern pattern = ((RegexpLikePredicate) predicate).getPattern();
MutableRoaringBitmap result = null;
char nextChar = BaseJsonIndexCreator.KEY_VALUE_SEPARATOR + 1;
SortedMap<String, RoaringBitmap> subMap =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can extract a method to calculate the subMap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@@ -291,6 +297,88 @@ private RoaringBitmap getMatchingFlattenedDocIds(Predicate predicate) {
} else {
return new RoaringBitmap();
}
} else if (predicateType == Predicate.Type.REGEXP_LIKE) {
String ceilingKey = _postingListMap.ceilingKey(key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) This is actually the lower boundary, consider naming it properly. Currently it is slightly confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@@ -291,6 +297,88 @@ private RoaringBitmap getMatchingFlattenedDocIds(Predicate predicate) {
} else {
return new RoaringBitmap();
}
} else if (predicateType == Predicate.Type.REGEXP_LIKE) {
String ceilingKey = _postingListMap.ceilingKey(key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we post key along? If so, we can directly try to find the key, and when calculating the subMap, exclude the lower bound

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry didn't follow this. Can you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting to handle it similar to the immutable one, where we lookup the key from the map (_postingListMap.get(key)), if it returns null it means the key does not exist; if it exists, we can use key as lower bound to calculate the subMap, and set the lower bound to be exclusive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

continue;
}
if (result == null) {
result = entry.getValue().toMutableRoaringBitmap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest not converting it to MutableRoaringBitmap. RoaringBitmap itself is mutable, and is faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

MutableRoaringBitmap result = null;
char nextChar = BaseJsonIndexCreator.KEY_VALUE_SEPARATOR + 1;
SortedMap<String, RoaringBitmap> subMap =
_postingListMap.subMap(ceilingKey, true, _postingListMap.floorKey(key + nextChar), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to calculate floor here? Can we directly put key + "\u0001" and make higher bound exclusive? This way we don't need to worry about key ending with "\u0001"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key + "\u0001" doesn't exist in the map correct? The idea here is to find the key just below key + "\u0001" such that it's the that last entry for the key we're searchig.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calling subMap(), does it require the lower and upper bound exist in the map? If not, we can simply use key as lower bound, key + '\u0001' as upper bound, exclusive on both side to get the sub-map

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. subMap does not require the key to be present

@@ -67,15 +69,17 @@ public RangePredicate(ExpressionContext lhs, String range) {
int upperLength = upper.length();
_upperInclusive = upper.charAt(upperLength - 1) == UPPER_INCLUSIVE;
_upperBound = upper.substring(0, upperLength - 1);
_rangeDataType = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this UNKNOWN? We should also put some comments to prevent people from using the data type if the predicate is constructed this way (which is the case most of the time).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

_postingListMap.subMap(ceilingKey, true, _postingListMap.floorKey(key + nextChar), true);

RangePredicate rangePredicate = (RangePredicate) predicate;
FieldSpec.DataType rangeDataType = rangePredicate.getRangeDataType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JSON, only number (double) and string are supported. We can probably simplify the handling to only differentiate double comparison and string comparison. This can prevent the corner case where value is double (e.g. 1.5) in JSON, but being converted to integer when doing predicate key > 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I've simplified to differentiate b/w numeric and string types

@itschrispeck
Copy link
Collaborator

IMO, iterating over all the keys of the mutable json index's map is wasteful, and we can do better by filtering the keys to iterate over by using a TreeMap.

+1 we thought about this but weren't sure how much insert/exact match performance would be impacted. I am happy to see this change, TreeMap fits our typical use cases much better!

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor comments

case JSON:
return StringUtils.compare((String) value1, (String) value2);
case BYTES:
return new ByteArray((byte[]) value1).compareTo(new ByteArray((byte[]) value2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new ByteArray((byte[]) value1).compareTo(new ByteArray((byte[]) value2));
return ByteArray.compare((byte[]) value1, (byte[]) value2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

return Long.compare((long) value1, (long) value2);
case STRING:
case JSON:
return StringUtils.compare((String) value1, (String) value2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return StringUtils.compare((String) value1, (String) value2);
return ((String) value1).compareTo((String) value2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Comment on lines 428 to 429
return _postingListMap.subMap(key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR, false,
key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR_NEXT_CHAR, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return _postingListMap.subMap(key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR, false,
key + BaseJsonIndexCreator.KEY_VALUE_SEPARATOR_NEXT_CHAR, false);
return _postingListMap.subMap(key + JsonIndexCreator.KEY_VALUE_SEPARATOR, false,
key + JsonIndexCreator.KEY_VALUE_SEPARATOR_NEXT_CHAR, false);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also modify ImmutableJsonIndexReader to use this constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants