Skip to content

Latest commit

 

History

History
87 lines (68 loc) · 5.24 KB

README.md

File metadata and controls

87 lines (68 loc) · 5.24 KB

HDBSCAN (Hierarchical density based clustering)

main algo

Setting this up

mkdir build && cd build
cmake -DCMAKE_CXX_FLAGS="-march=native" -G Ninja ..
ninja        # building binaries
ninja check  # running tests

CMake arguments

Note that all of these arguments are cached by CMake (i.e. need to be explicitly set to 0 again after having them used).

General arguments

  • -DHDBSCAN_VERBOSE: enable or disable (default) verbose mode by setting to 1 or 0; do not enable when benchmarking!
  • -DHDBSCAN_DATA_DIR: absolute path to folder with generated input files. Defaults to /data
  • -DHDBSCAN_INSTRUMENT: enable or disable (default) cost analysis instrumentation. Do not enable when benchmarking as it adds extra operations!
  • -DCMAKE_CXX_COMPILER: Choose your compiler (i.e. g++, clang++). Defaults to g++
  • -DOPT_LEVEL: Choose your compiler optimization level (i.e. O0, O1). Defaults to O3
  • -DBENCHMARK_AMD: enable when using an AMD system for benchmarking (not fully verified correctness yet). See the comments in benchmark_util.h
  • -DFINE_GRAINED_BENCH: enables fine grained benchmarking of distance computation. When enabled, the regular benchmark suite should not be used but instead the regular main binaries.

Arguments to determine algorithm version

  • -DHDBSCAN_PRECOMPUTE_DIST_TRIANG: enable or disable upper triangular version of precomputation of the pairwise distance matrix. If set to 0, the hdbscan_basic uses the full distance matrix of size n*n. Set to 1, a special triangular matrix laid out linear in memory with size n*(n+1)/2 is used.
  • -DHDBSCAN_QUICKSELECT: use vectorized quickselect instead of get_kth_neighbor (only affects vectorized version)
  • -DSPECIALIZED_DISTANCE: enable distance calculations specialized for specific dimensions (2 and 4)

Print CMake arguments at runtime

Including the header file config.h (generated by CMake) grants access to CMake variables through preprocessor directives at runtime (see main.cpp). You can add further CMake variables in config.h.in. Custom CMake variables have to be added with #cmakedefine ....

Ninja targets

  • all: builds hdbscan and hdbscan_vec (default)
  • benchmark: build and run non-vectorized benchmark
  • benchmark_vec: build and run vectorized benchmark
  • build_bench: build only non-vectorized benchmark but do not run it
  • build_bench_vec: build only vectorized benchmark but do not run it
  • check: build and run unit tests

Benchmarking

You might need to install the perf tool to run the benchmarks. Benchmarking is currently set up for the Intel Skylake architecture and AMD processor family 17h. Please see and modify the event and unmask values in benchmark_util.h if needed. To run the benchmark with the current build: ninja benchmark

Note! If all flop counts stay at zero, it might be that your user does not have the permission to see kernel PMU events. To fix this, try

sudo sysctl -w kernel.perf_event_paranoid=-1

This gives the user access to (almost) all events. Reference here.

Scripts

We have prepared the following scripts to run multiple benchmarks and plot the performance measurements:

  • benchmarks.sh is the entrypoint to a specific comparison as well as all experiments. It should generate data, build the project with different flags, execute benchmarks and save plotted results (i.e. calls all scripts outlined below).
    • Usage: ./benchmarks.sh [intel|amd] [baseline_flags|...|all] (add new comparisons here).
  • run_perf_measurements.sh is the script that is called by the above benchmarks.sh. It executes a specific binary with different csv inputs.
    • Usage: run_perf_measurements.sh <benchm_name> <binary_name> <filebase> <n>.
  • helper_scripts/generate_clusters.py generates and stores data as data/perf_data_d<x>_<n>.csv, for example perf_data_d2_7.csv where d is number of feature dimensions. The sizes of inputs are roughly evenly spaced from 32 to 14436.
    • Usage: generate_clusters.py <relative_data_folder_path> <n_clusters> <d_dimensions>
  • helper_scripts/plot_performance_alt.py creates a nice performance plot close to what was given in the lecture guidelines on benchmarking.
    • Arguments:
      • --system [intel|amd]
      • --data-path <data/timings/>
      • --files <x1.csv> ... <xn.csv>
      • --save-path <file>
      • --metric ['fp/c', 'cycles', 'time']
      • --x-scale ['log', 'linear']
  • Other useful scripts, e.g. plotting clusters: See helper_scripts/.

Please see the outline and TODOs in benchmarks.sh.

References

R. J. G. B. Campello, D. Moulavi, A. Zimek, and J. Sander, “Hierarchical density estimates for data clustering, visualization, and outlier detection”

L. McInnes and J. Healy, “Accelerated hierarchical density based clustering”

https://github.com/ojmakhura/hdbscan

https://github.com/rohanmohapatra/hdbscan-cpp

Team 35

© Tobia Claglüna, Martin Erhart, Alexander Hägele, Tom Lausberg