Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency #14105

Merged
merged 104 commits into from
Oct 17, 2024

Conversation

jackluo923
Copy link
Contributor

@jackluo923 jackluo923 commented Sep 27, 2024

The data layout of the multi-value fixed byte raw forward index can be optimized to enhance storage efficiency.

Consider the following multi-value document as an example: [int(1), int(2), int(3)]. The current binary data layout in MVFixedBytesRawFwdIndex is as follows: 0x00000010 0x00000003 0x00000001 0x00000002 0x00000003.

  1. The first 4 bytes 0x00000010 is an integer representing the total payload length of the byte array containing the multi-value document content, which in this case is 16 bytes.
  2. The next 4 bytes 0x00000003 is an integer explicitly representing the number of elements in the multi-value document (i.e., 3).
  3. The remaining 12 bytes 0x00000001 0x00000002 0x00000003 are 3 integers representing the 3 values of the multi-value document: 1, 2, and 3.

In Pinot, the fixed byte raw forward index can only contain one specific fixed-length data type: int, long, float, or double. Rather than explicitly specifying the number of elements for each document using an integer, this value can be omitted and instead inferred implicitly using the following calculation:

number of elements = buffer payload length / size of data type

If the forward index uses the passthrough chunk compression type (i.e., no compression), we can save 4 bytes per document by omitting the explicit element count. This results in the following savings:

  • For documents with 0 elements, we save 50%.
  • For documents with 1 element, we save 33%.
  • For documents with 2 elements, we save 25%.
  • As the number of elements increases, the percentage of space saved decreases accordingly.

For forward indexes that leverage compression to reduce data size, the savings can be even more significant in some scenarios. This PR includes a unit test, VarByteChunkV5Test#validateCompressionRatioIncrease, which demonstrates this. In particular, we used ZStandard as the chunk compressor and inserted 1 million short multi-value (MV) documents, where the length follows a Gaussian distribution. In this experiment, the values of each integer in the MV documents were also somewhat repetitive. Under these conditions, we observed 50%+ reduction in on-disk file size compared to V4 fwd index writer version

This PR introduces the implicit length optimization via VarByteChunkForwardIndexWriterV5 on the write path, alongside VarByteChunkForwardIndexReaderV5 on the read path. MultiValueFixedByteRawIndexCreator and ForwardIndexReaderFactory are also modified accordingly to use the new index writer and reader when the index version is set to 5 or greater. After this PR is merged, other composite forward indexes such as the CLPForwardIndexCreatorV1 forward index can leverage these new classes to significantly improve the overall compression ratio.

@codecov-commenter
Copy link

codecov-commenter commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 60.00000% with 18 lines in your changes missing coverage. Please review.

Project coverage is 63.76%. Comparing base (59551e4) to head (6fe4517).
Report is 1195 commits behind head on master.

Files with missing lines Patch % Lines
...ders/forward/VarByteChunkForwardIndexReaderV5.java 27.27% 8 Missing ⚠️
.../writer/impl/VarByteChunkForwardIndexWriterV5.java 45.45% 6 Missing ⚠️
...gment/index/forward/ForwardIndexReaderFactory.java 33.33% 1 Missing and 1 partial ⚠️
.../writer/impl/VarByteChunkForwardIndexWriterV4.java 75.00% 1 Missing ⚠️
...ders/forward/VarByteChunkForwardIndexReaderV4.java 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14105      +/-   ##
============================================
+ Coverage     61.75%   63.76%   +2.01%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2626     +190     
  Lines        133233   144646   +11413     
  Branches      20636    22136    +1500     
============================================
+ Hits          82274    92239    +9965     
- Misses        44911    45596     +685     
- Partials       6048     6811     +763     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 55.38% <20.00%> (-6.33%) ⬇️
java-21 63.66% <60.00%> (+2.03%) ⬆️
skip-bytebuffers-false 63.75% <60.00%> (+2.00%) ⬆️
skip-bytebuffers-true 63.64% <60.00%> (+35.91%) ⬆️
temurin 63.76% <60.00%> (+2.01%) ⬆️
unittests 63.76% <60.00%> (+2.01%) ⬆️
unittests1 55.42% <20.00%> (+8.53%) ⬆️
unittests2 34.33% <60.00%> (+6.60%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jackluo923 jackluo923 changed the title WIP: Enable toggling of explicit MV entry size for MVFixedBytesRawFwdIndex Introduce MultiValueFixedByteRawIndexCreatorV2 and corresponding forward index reader path, improving space efficiency over MultiValueFixedByteRawIndexCreator Oct 2, 2024
@Jackie-Jiang Jackie-Jiang added enhancement feature release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) labels Oct 2, 2024
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's introduce a new forward index version v5 for this new format

/**
Same as MultiValueFixedByteRawIndexCreator, but without storing the number of elements for each row.
*/
public class MultiValueFixedByteRawIndexCreatorV2 extends MultiValueFixedByteRawIndexCreator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to add a new creator because creator is used to handle creation of different version of forward index. Instead, we want to add a new raw index version v5 for this new format

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/**
Same as FixedByteChunkMVForwardIndexReader, but the number of elements for each row is inferred
*/
public final class FixedByteChunkMVForwardIndexReaderV2 extends FixedByteChunkMVForwardIndexReader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call it V5 to be consistent with index version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jackluo923 jackluo923 changed the title Introduce MultiValueFixedByteRawIndexCreatorV2 and corresponding forward index reader path, improving space efficiency over MultiValueFixedByteRawIndexCreator Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency Oct 8, 2024
@jackluo923 jackluo923 changed the title Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency WIP: Introduce raw fwd index version V5 containing implicit num doc length, improving space efficiency Oct 8, 2024
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM



/**
* Forward index writer that extends {@link VarByteChunkForwardIndexWriterV4} with the only difference being the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the only difference. Let's also document the value format difference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -81,6 +80,12 @@ public VarByteChunkForwardIndexReaderV4(PinotDataBuffer dataBuffer, FieldSpec.Da
_isSingleValue = isSingleValue;
}

public void validateIndexVersion(PinotDataBuffer dataBuffer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant we can add a getVersion() method into this class, and override it in v5 reader

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a few minor comments

@@ -76,11 +76,13 @@
public class VarByteChunkForwardIndexWriterV4 implements VarByteChunkWriter {
public static final int VERSION = 4;

private static final Logger LOGGER = LoggerFactory.getLogger(VarByteChunkForwardIndexWriterV4.class);
// Use the run-time concrete class to retrieve the logger
protected final Logger _logger = LoggerFactory.getLogger(this.getClass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected final Logger _logger = LoggerFactory.getLogger(this.getClass());
protected final Logger _logger = LoggerFactory.getLogger(getClass());

@Jackie-Jiang Jackie-Jiang merged commit ad37bd8 into apache:master Oct 17, 2024
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Configuration Config changes (addition/deletion/change in behavior) enhancement feature release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants