A C++ implementation of sliced indexes [3,4], that can be used to compress bitmaps and inverted lists.
Also refer to the CSUR paper [5] for further experiments and comparisons (Section 6).
This guide is meant to provide a brief overview of the library and to illustrate its functionalities through some examples.
- Compiling the code
- Quick Start
- Building a collection of sequences
- Operations
- Testing
- Tools
- An example microbenchmark
- Authors
- References
The code is tested on Linux with gcc
7.3.0 and on Mac 10.14 with clang
10.0.0.
To build the code, CMake
is required.
The code has few external dependencies (for testing, serialization and memory-mapping facilities), so clone the repository with
git clone --recursive https://github.com/jermp/s_indexes.git
If you have cloned the repository without --recursive
, you will need to perform the following commands before
compiling:
git submodule init
git submodule update
To compile the code for a release environment (see file CMakeLists.txt
for the used compilation flags), it is sufficient to do the following:
mkdir build
cd build
cmake ..
make
Hint: Use make -j4
to compile the library in parallel using, e.g., 4 jobs.
For a testing environment, use the following instead:
mkdir debug_build
cd debug_build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DUSE_SANITIZERS=On
make
For a quick start, see the source file src/example.cpp
.
After compilation, run this example with
./example < ../data/test_sequence
which will:
- read from the standard input using a test
sequence from the directory
data
; - build the data structure in memory and perform some operations (decode and select).
By specifying an output file name, it is possible to serialize the data structure on disk. To perform the operations, the data structure is then memory mapped from such file. To do so, type
./example -o out.bin < ../data/test_sequence
#include <iostream>
#include "../external/essentials/include/essentials.hpp"
#include "builder.hpp"
#include "s_sequence.hpp"
#include "select.hpp"
#include "decode.hpp"
using namespace sliced;
int main(int argc, char** argv) {
int mandatory = 1;
char const* output_filename = nullptr;
for (int i = mandatory; i != argc; ++i) {
if (std::string(argv[i]) == "-o") {
++i;
output_filename = argv[i];
} else if (std::string(argv[i]) == "-h") {
std::cout << argv[0] << " -o output_filename < input" << std::endl;
return 1;
} else {
std::cout << "unknown option '" << argv[i] << "'" << std::endl;
return 1;
}
}
std::vector<uint32_t> input;
{ // read input from std::in
uint32_t n, x;
std::cin >> n;
input.reserve(n);
for (uint32_t i = 0; i != n; ++i) {
std::cin >> x;
input.push_back(x);
}
}
// build the sequence and print statistics
s_sequence::builder builder;
auto stats = builder.build(input.data(), input.size());
stats.print();
mm::file_source<uint8_t> mm_file;
uint8_t const* data = nullptr;
if (output_filename) { // if an output file is specified, then serialize
essentials::print_size(builder);
essentials::save<s_sequence::builder>(builder, output_filename);
// mmap
int advice = mm::advice::normal; // can be also random and sequential
mm_file.open(output_filename, advice);
// skip first 8 bytes storing the number of written bytes
data = mm_file.data() + 8;
} else { // otherwise work directly in memory
data = builder.data();
}
// initialize a s_sequence from data, regardless the source
s_sequence ss(data);
uint32_t size = ss.size();
// decode whole list to an output buffer
std::vector<uint32_t> out(size);
ss.decode(out.data());
// check written values
uint32_t value = 0;
for (uint32_t i = 0; i != size; ++i) {
if (input[i] != out[i]) {
std::cout << "got " << out[i] << " but expected " << input[i]
<< std::endl;
return 1;
}
ss.select(i, value); // select i-th element
if (value != out[i]) {
std::cout << "got " << value << " but expected " << out[i]
<< std::endl;
return 1;
}
}
return 0;
}
Typically, we want to build all the sequences from
a collection.
In this case, we assume that the input collection
is a binary file with all the sequences being written
as 32-bit integers, as popular for also other libraries
such as ds2i
.
In particular, each sequence is prefixed by an additional
32-bit integer representing the size of the sequence.
The collection file starts with a singleton sequence
containing the universe of representation of the sequences, i.e., the maximum representable value.
For example, an test input collection with 100 sequences drawn from a universe of size 1,000,000 can be generated with
./gen_clustered_data 100 1000000 test_collection --binary
To build an index from such collection, then use
./build test_collection --density 0.01 --out test_collection.out
with a density threshold of 0.01 and an output file
test_collection.out
onto which the data structure is serialized.
You should get an output like:
universe size: 1000000
processed 100 sequences, 45911859 integers
chunks: 1572
full chunks: 466 (66.5183% of ints)
empty chunks: 310 (19.7201% of chunks)
dense chunks: 513 (30.3916% of ints)
sparse chunks: 283 (3.09016% of ints)
blocks: 23395
empty blocks: 14 (0.0598418% of blocks)
dense blocks: 7614 (2.53826% of ints)
sparse blocks: 15767 (0.551905% of ints)
0.00179405 [bpi] for chunks' headers
0.00540078 [bpi] for blocks' headers
0.732272 [bpi] for dense chunks
0.0424549 [bpi] for dense blocks
0.0468998 [bpi] for sparse blocks
total bytes: 4757416
total bpi: 0.828965
from which you can see some statistics about the built data structure.
Given a single sliced sequence, it is possible to execute the
following operations (see also include/s_sequence.hpp
):
/* decode the sequence to the output buffer */
size_t decode(uint32_t* out) const;
/* convert the sequence to an output bitmap */
size_t uncompress(uint64_t* out) const;
/* select the i-th value */
bool select(uint32_t i, uint32_t& value) const;
/* check if value is present in the sequence */
bool contains(uint32_t value) const;
/* returns the minimum value that is >= lower_bound
if found, otherwise a "not found" value is returned */
uint32_t next_geq(uint32_t lower_bound) const;
Given a collection of (at least 2) sliced sequences, it is possible to perform intersection and merging of the sequences:
/* writes the result of the intersection between l and s to the output buffer,
returning the size of the result */
size_t pairwise_intersection(s_sequence const& l, s_sequence const& r, uint32_t* out);
/* writes the result of the union between l and s to the output buffer,
returning the size of the result */
size_t pairwise_union(s_sequence const& l, s_sequence const& r, uint32_t* out);
/* writes the result of the intersection between the
sequences to the output buffer, returning the size of the result */
size_t intersection(std::vector<s_sequence>& sequences, uint32_t* out);
/* writes the result of the union between the
sequences to the output buffer, returning the size of the result */
size_t union_many(std::vector<s_sequence>& sequences, uint32_t* out);
The source src
folder contains programs to benchmark such operations.
Use:
./decode test_collection.out
to decode all the sequences in the collection. You should get something like:
decoded 100 sequences
decoded 45911859 integers
Elapsed time: 0.034721 [sec]
Mean per sequence: 347.21 [musec]
Mean per integer: 0.756253 [ns]
To execute some intersection operations, first generate some queries with
./gen_random_pairwise_queries 1000 100 > test_pairwise_queries
and then run
./intersect test_collection.out 1000 < test_pairwise_queries
You should get something like:
performing 1000 pairwise-intersections...
Mean per run: 136562 [musec]
Mean per query: 136.562 [musec]
The subfolder test
contains testing programs to maintain
the correctness of the implementation.
To run a test, just run the corresponding program without argument to see the required ones.
For example, to test decoding correctness, use
./test_decode test_collection.out ../data/test_collection 0.01
which will check every decoded integer against the original input collection (note that you must provide the correct original input collection as well as the density level it was used during building).
The subfolder tools
contains some programs generating
synthetic data to test the code.
For example, the sequence data/test_sequence
was generated with
./gen_clustered_data 1 1000000 test_sequence
A test collection can be generated with
./gen_clustered_data 100 1000000 test_collection --binary
A test query log can be generated with
./gen_random_pairwise_queries 1000 100 > test_pairwise_queries
In the following microbenchmark we show the number of bits per integer (bpi) and average microseconds per list intersection query with 2 sequences.
We compare Slicing with Roaring [1] and Partitioned Elias-Fano [2].
We use the datasets Census-Income, Census-1881, Weather and Wikileaks shipped with the CRoaring Library (see directory benchmarks/realdata
).
See [1] for a description of such datasets.
To measure the bpi rate, we serialize the data structures and take the written number of bytes. To measure query timings, we compute 1,000 intersections between random pairs of lists for 10 times and report the average.
The benchmark was executed on a Linux 4.4.0 server machine with
an Intel i7-7700 CPU (@3.6 GHz) and 64 GB of RAM.
The code was compiled with gcc 7.3.0 with all optimizations
(see also CMakeLists.txt
).
Dataset | Roaring | Slicing | PEF |
---|---|---|---|
Census-Income | 2.74 | 2.23 | 2.03 |
Census-1881 | 15.93 | 10.83 | 7.28 |
Weather | 5.43 | 4.05 | 3.13 |
Wikileaks | 16.30 | 10.18 | 8.87 |
Dataset | Roaring | Slicing | PEF |
---|---|---|---|
Census-Income | 4.68 | 11.56 | 115.17 |
Census-1881 | 0.15 | 0.18 | 0.92 |
Weather | 13.37 | 25.70 | 213.00 |
Wikileaks | 0.86 | 0.47 | 2.51 |
- [1] Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O’Hara, François Saint-Jacques, and Gregory Ssi-Yan-Kai. 2018. Roaring bitmaps: Implementation of an optimized software library. Software: Practice and Experience 48, 4, 867–895.
- [2] Giuseppe Ottaviano and Rossano Venturini. Partitioned Elias-Fano Indexes. 2014. In Proceedings of the 37th International Conference on Research and Development in Information Retrieval. 273–282.
- [3] Giulio Ermanno Pibiri. Fast and Compact Set Intersection through Recursive Universe Partitioning. 2021. IEEE Data Compression Conference (DCC).
- [4] Giulio Ermanno Pibiri. On Slicing Sorted Integer Sequences. 2019. arXiv preprint. https://arxiv.org/abs/1907.01032
- [5] Giulio Ermanno Pibiri and Rossano Venturini. Techniques for Inverted Index Compression. 2020. ACM Computing Surveys (CSUR). https://arxiv.org/abs/1908.10598v2