Skip to content

Small tool for aggregating and grouping data. Focused on simplicity, speed and memory efficiency. Written in C++. Created as offline interview task.

License

Notifications You must be signed in to change notification settings

przemek83/data-explorer

Repository files navigation

Build & test CodeQL codecov

Quality Gate Status Bugs Code Smells Coverage Duplicated Lines (%)

About

Data Explorer is a tool for performing various operations on datasets. It supports operations like aggregation (average, minimum, and maximum) and grouping, providing a simple interface for querying data.

Building:

Clone and use CMake with GCC/Clang/MSVC to compile the project and tests from an IDE or command line. CMake should:

  • configure everything automatically,
  • download and build dependency,
  • compile and create binaries.

Used tools and libs

Tool Windows Ubuntu
OS version 10 22H2 24.04
GCC 13.1.0 13.2.0
CMake 3.30.2 3.28.3
Git 2.46.0 2.43.0
cpputils 1.0.0 1.0.0
GoogleTest 1.15.2 1.15.2

Usage:

Launching:

Launch the binary and pass a single parameter, which is the file name containing data:

$ data-explorer sample.txt

Interface

Application expects a series of commands as text lines. Each line should have the following structure:

<operation> <aggregation column> <grouping column>  

where:

  • operation is an aggregation function of one of the following types:
    • avg: calculates the average of a set of values.
    • min: finds the minimum value in a set of values.
    • max: finds the maximum value in a set of values.
  • aggregation column is a numerical data column using which aggregation will be done.
  • grouping column is a column used for grouping results.

Usage Example:

$ avg score movie_name

Example output:

Execute: AVG score GROUPED BY movie_name
Execution time: 12us
Results:
ender's_game 8
pulp_fiction 6
inception 8

License

This project is licensed under the MIT License. See the LICENSE file for details.

The project uses the following open-source software:

Name License Home Description
cpputils MIT https://github.com/przemek83/cpputils collection of C++ utility classes
GoogleTest BSD-3-Clause https://github.com/google/googletest testing framework

Testing

For testing purposes, gtest framework is used. Build the project first. Make sure that the data-explorer-tests target is built. Modern IDEs supporting CMake also support running tests with monitoring of failures. But in case you would like to run it manually, go to the build/tests directory, where the⁣ binary data-explorer-tests should be available. Launching it should produce the following output on Linux:

$ ./data-explorer-tests
Running main() from <path>/data-explorer/build/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 44 tests from 8 test suites.
[----------] Global test environment set-up.
[----------] 1 test from DataExplorer
[ RUN      ] DataExplorer.executeQuery
[       OK ] DataExplorer.executeQuery (0 ms)

(...)

[ RUN      ] UserInterfaceTest.GetQueryOperationWrongAggregatingColumn
[       OK ] UserInterfaceTest.GetQueryOperationWrongAggregatingColumn (0 ms)
[----------] 7 tests from UserInterfaceTest (0 ms total)

[----------] Global test environment tear-down
[==========] 44 tests from 8 test suites ran. (1 ms total)
[  PASSED  ] 44 tests.

As an alternative, CTest can be used to run tests from build directory:

$ ctest
Test project <path>/data-explorer/build
    Start  1: DataExplorer.executeQuery
1/44 Test  #1: DataExplorer.executeQuery ...................................   Passed    0.00 sec

(...)

    Start 44: UserInterfaceTest.GetQueryOperationWrongAggregatingColumn
44/44 Test #44: UserInterfaceTest.GetQueryOperationWrongAggregatingColumn ...   Passed    0.00 sec

100% tests passed, 0 tests failed out of 44

Total Test time (real) =   0.10 sec

Additional info

As speed is the most important expectation from the task, there was some optimization was performed. Ones with the biggest impact:

  • Used std::unordered_map instead of std::map.
  • Used std::vectors to store data and passed by const reference.
  • Storing strings as mapped values (std::string <-> unsigned int) and usage of indexes for operations (performance and significant memory optimization).
  • Minimized copying.

Potential further improvements

For bigger datasets and more sophisticated operations, the following enhancements might be viable:

  • Usage for multithreading in calculations using std::async + std::future.
  • Usage of MPI (make sense with more sophisticated calculations).
  • GPU calculations (in case of more complex calculations).

About

Small tool for aggregating and grouping data. Focused on simplicity, speed and memory efficiency. Written in C++. Created as offline interview task.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published