Skip to content

Commit

Permalink
Updated javadoc for VarByteChunkForwardIndexWriterV5
Browse files Browse the repository at this point in the history
  • Loading branch information
jackluo923 committed Oct 16, 2024
1 parent cfbc9ee commit 0da0ca7
Showing 1 changed file with 46 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,54 @@


/**
* Forward index writer that extends {@link VarByteChunkForwardIndexWriterV4} with the only difference being the
* version tag is now bumped from 4 to 5.
* Forward index writer that extends {@link VarByteChunkForwardIndexWriterV4} and overrides the data layout for
* multi-value fixed byte operations to improve space efficiency.
*
* <p>The {@code VERSION} tag is a {@code static final} class variable set to {@code 5}. Since static variables
* are shadowed in the child class thus associated with the class that defines them, care must be taken to ensure
* that the parent class can correctly observe the child class's {@code VERSION} value at runtime.</p>
* <p>Consider the following multi-value document as an example: {@code [int(1), int(2), int(3)]}.
* The current binary data layout in {@code VarByteChunkForwardIndexWriterV4} is as follows:</p>
* <pre>
* 0x00000010 0x00000003 0x00000001 0x00000002 0x00000003
* </pre>
*
* <p>To achieve this, the {@code getVersion()} method is overridden to return the concrete subclass's
* <ol>
* <li>The first 4 bytes ({@code 0x00000010}) represent the total payload length of the byte array
* containing the multi-value document content, which in this case is 16 bytes.</li>
*
* <li>The next 4 bytes ({@code 0x00000003}) represent the number of elements in the multi-value document (i.e., 3)
* .</li>
*
* <li>The remaining 12 bytes ({@code 0x00000001 0x00000002 0x00000003}) represent the 3 integer values of the
* multi-value document: 1, 2, and 3.</li>
* </ol>
*
* <p>In Pinot, the fixed byte raw forward index can only store one specific fixed-length data type:
* {@code int}, {@code long}, {@code float}, or {@code double}. Instead of explicitly storing the number of elements
* for each document for multi-value document, this value can be inferred by:</p>
* <pre>
* number of elements = buffer payload length / size of data type
* </pre>
*
* <p>If the forward index uses the passthrough chunk compression type (i.e., no compression), we can save
* 4 bytes per document by omitting the explicit element count. This leads to the following space savings:</p>
*
* <ul>
* <li>For documents with 0 elements, we save 50%.</li>
* <li>For documents with 1 element, we save 33%.</li>
* <li>For documents with 2 elements, we save 25%.</li>
* <li>As the number of elements increases, the percentage of space saved decreases.</li>
* </ul>
*
* <p>For forward indexes that use compression to reduce data size, the savings can be even more significant
* in certain cases. This is demonstrated in the unit test {@link VarByteChunkV5Test#validateCompressionRatioIncrease},
* where ZStandard was used as the chunk compressor. In the test, 1 million short multi-value (MV) documents
* were inserted, following a Gaussian distribution for document lengths. Additionally, the values of each integer
* in the MV documents were somewhat repetitive. Under these conditions, we observed a 50%+ reduction in on-disk
* file size compared to the V4 forward index writer version.</p>
*
* <p>Note that the {@code VERSION} tag is a {@code static final} class variable set to {@code 5}. Since static
* variables are shadowed in the child class thus associated with the class that defines them, care must be taken to
* ensure that the parent class can correctly observe the child class's {@code VERSION} value at runtime. To handle
* this cleanly and correctly, the {@code getVersion()} method is overridden to return the concrete subclass's
* {@code VERSION} value, ensuring that the correct version number is returned even when using a reference
* to the parent class.</p>
*
Expand Down

0 comments on commit 0da0ca7

Please sign in to comment.