-
Notifications
You must be signed in to change notification settings - Fork 0
Algorithm Description
- The encoding result of an empty list is an empty list. Assume the list is not empty from now on.
- The encoded result is presented as a list of integers, but is to be considered as a continuous bit stream. That is, for the purpose of the algorithm description, the integers can be thought as concatenated together.
- The encoding result consists of one or more chunks of variable length. Every chunk starts with a header part that contains information about the length of the chunk so that one can read a chunk to the end after examining header information.
- Chunks are independent of each other. That is, each chunk encodes some part of the input list and the result of decoding is obtained by concatenating the result of each chunk.
A chunk is a bit stream consisting of header bits and a (possibly empty) body part. The chunk header consists of the following fields:
length
bitsize
base
first
The chunk body is empty in case of either length = 0
or bitsize = 0
. Otherwise it is the result of the Buffered Run-length encoding procedure applied to a stream of commensurate bit-blocks. The length
value determines the number of blocks and the bitsize
value the size of each block. The stream encodes the differential of a non-empty integer list n[0], ..., n[k]
, where n[i+1] > n[i]
for i = 0 .. length-1
, in the following way:
n[0] = first
-
n[i] = d[i-1] + n[i-1] + base
for0 < i < k
, whered[i]
is obtained from the bit-block at thei
-th position in the stream.
The base
value is set to the smallest difference between two consecutive integers, while bitsize
is the number of bits needed to represent each d[i]
, i.e.:
base = min(n[i] - n[i-1])
-
bitsize = ceil(log2(max([n[i]-n[i-1]]) - base))
wherelog2(0) = 0
Thus, the bit stream encodes a sequence of length+1
strictly increasing integers where bitsize
vanishes if the sequence is equidistant, i.e., has constant differences.
The crucial part in the encoding procedure is to find a partition of the given list in a way that the total number of bits needed in all chunks together is minimal. For example, encoding a list like 0,1,2, ..., 999, 1000000 with a single chunk would require a large bitsize 1000 times, while most of the elements are equidistant and could thus be stored with zero bitsize. In this case, an optimal partition is obvious, while the problem is not trivial in general.
While determining an optimal solution would be quite complex and expensive in terms of run time, the Delta Chunk Split algorithm is intended to provide a simple and fast approximation to the optimal solution.
Let cost
denote the total number of bits needed to store a chunk, i.e.:
cost = cost(header) + length * bitsize
A Delta Point is a position within the given input stream, where the number bits need to store the differences changes, i.e. where log2(max(n[i] - n[i-1]))
increases. Only Delta Points come in question for split up chunks at a position. On every Delta Point we do the following:
- If the
bitsize
was 0 before, split and start a new chunk. - Otherwise:
- If there was an earlier Delta Point, compare the cost of the current chunk with the sum of the two chunks that would arise if we would split at the earlier Delta Point. If the sum is less then the current cost, split at the earlier Delta Point.
- Otherwise, remember this Delta Point for the next comparison.
By way of an increment function, that expands an existing chunk with a new difference, we can accomplish the procedure efficiently by maintaining three chunks simultaneously: the current chunk as well as the two chunks that would arise if we would split at the last remembered Delta Point. Thus, the algorithm runs nearly within a single loop, with some re-locations in the case of splitting at a remembered Delta Point.
Note that the algorithm of chunk splitting does not effect the encoding format and thus not the decoding procedure, so further improvements to the algorithm can be implemented without breaking the compatibility with data that was encoded with an earlier version.
IntZip - Fast compression of integer sets. Copyright 2015 Sebastian Böthin