Skip to content

This repository contains implementations of the Count-Min and Count-Median sketches. Furthermore it contains 6 different implementations of approximate heavy hitters algorithms.

License

Notifications You must be signed in to change notification settings

mortzdk/heavy-hitters

Repository files navigation

Estimating Frequencies and finding Heavy Hitters

In this project...

Main Papers

List of papers, authors etc.

How to use

The

Dependencies

All code in this library is dependent on some sort of linux distibution. It has been run and tested the newest versions of Arch Linux and Ubuntu.

To be able to run the unit tests for all implementations the Criterion unit testing library (https://github.com/Snaipe/Criterion) must be downloaded and installed.

Furthermore to run some of the bash scripts, the terminal calculator bc must be installed. This is due to the fact that bash cannot handle floating values.

Modules

To run the benchmark scripts the module libmeasure must be installed and compiled. This can be done the following way:

git submodule init
git submodule update --remote
cd modules/libmeasure
make clean && make
cd ../..

Or simply by checking that the modules/libmeasure folder contains the libmeasure source code and the shared object file libmeasure.so.

Note that libmeasure dependent on the PAPI library (http://icl.cs.utk.edu/papi/), which is used to measure low level stuff, such as L1/L2 cache misses/accesses, branch mispredictions, cycles, instructions, running times, TLB misses/accesses and plenty more.

External Contribution

To generate data according to some Zipfian distribution fast, the Alias Method more or less as implemented in the R library is used. Moreover to generate uniform random numbers between [0;1) throughout the project, the unif_rand has likewise been used.

To be able to compare with existing implementations, some of the implementations was made to fit our heavy hitter abstraction, which enabled testing of the implementation on the same terms as our own implementations. The expectations and results of the authors of those implementations can be found in the paper: Finding Frequent Items in Data Streams

To compute the median in the Count-Median Sketch, the two implementations that had been found the fastest among a lot is used.

Finally to generate the absolute errors of the sketches when the δ-percent of the badest estimates are removed, a quicksort is performed to determine the point to cut off.

Unit Tests

To run the unit tests simply type:

make clean && make && make test

This will run all unit tests in this project. To run a specific set of unittests type:

make clean && make && make run_test_{TESTNAME}

where the {TESTNAME} is the name of a specific test set.

About

This repository contains implementations of the Count-Min and Count-Median sketches. Furthermore it contains 6 different implementations of approximate heavy hitters algorithms.

Resources

License

Stars

Watchers

Forks

Packages

No packages published