Skip to content

Algorithm Description

Sebastian Böthin edited this page May 1, 2015 · 11 revisions

Encoding Format

  • The encoding result of an empty list is an empty list. Assume the list is not empty from now on.
  • The encoded result is presented as a list of integers, but is to be considered as a continuous bit stream. That is, for the purpose of the algorithm description, the integers can be thought as concatenated together.
  • The encoding result consists of one or more chunks of variable length. Every chunk starts with a header part that contains information about the length of the chunk so that one can read a chunk to the end after examining header information.
  • Chunks are independent of each other. That is, each chunk encodes some part of the input list and the result of decoding is obtained by concatenating the result of each chunk.

Chunk Format

A chunk is a bit stream consisting of header bits and a (possibly empty) body part. The chunk header consists of the following fields:

  • length
  • bitsize
  • base
  • first

The chunk body is empty in case of either length = 0 or bitsize = 0. Otherwise it is the result of the Buffered Run-length encoding procedure applied to a stream of commensurate bit-blocks. The length value determines the number of blocks and the bitsize value the size of each block. The stream encodes the differential of a non-empty integer list n[0], ..., n[k], where n[i+1] > n[i] for i = 0 .. length-1, in the following way:

  • n[0] = first
  • n[i] = d[i-1] + n[i-1] + base for 0 < i < k, where d[i] is obtained from the bit-block at the i-th position in the stream.

The base value is set to the smallest difference between two consecutive integers, while bitsize is the number of bits needed to represent each d[i], i.e.:

  • base = min(n[i] - n[i-1])
  • bitsize = ceil(log2(max([n[i]-n[i-1]]) - base)) where log2(0) = 0

Thus, the bit stream encodes a sequence of length+1 strictly increasing integers where bitsize vanishes if the sequence is equidistant, i.e., has constant differences.

Delta Chunk Splitting

The crucial part in the encoding procedure is to find a partition of the given list in a way that the total number of bits needed in all chunks together is minimal. For example, encoding a list like 0,1,2, ..., 999, 1000000 with a single chunk would require a large bitsize 1000 times, while most of the elements are equidistant and could thus be stored with zero bitsize. In this case, an optimal partition is obvious, while the problem is not trivial in general.

While determining an optimal solution would be quite complex and expensive in terms of run time, the Delta Chunk Split algorithm is intended to provide a simple and fast approximation to the optimal solution.

Let cost denote the total number of bits needed to store a chunk, i.e.:

  • cost = cost(header) + length * bitsize

A Delta Point is a position within the given input stream, where the number bits need to store the differences changes, i.e. where log2(max(n[i] - n[i-1])) increases. Only Delta Points come in question for split up chunks at a position. On every Delta Point we do the following:

  • If the bitsize was 0 before, split and start a new chunk.
  • Otherwise:
    • If there was an earlier Delta Point, compare the cost of the current chunk with the sum of the two chunks that would arise if we would split at the earlier Delta Point. If the sum is less then the current cost, split at the earlier Delta Point.
    • Otherwise, remember this Delta Point for the next comparison.

By way of an increment function, that expands an existing chunk with a new difference, we can accomplish the procedure efficiently by maintaining three chunks simultaneously: the current chunk as well as the two chunks that would arise if we would split at the last remembered Delta Point. Thus, the algorithm runs nearly within a single loop, with some re-locations in the case of splitting at a remembered Delta Point.

Note that the algorithm of chunk splitting does not effect the encoding format and thus not the decoding procedure, so further improvements to the algorithm can be implemented without breaking the compatibility with data that was encoded with an earlier version.

Buffered Run-length encoding

Number compression