Skip to content

gmaldona/optimized-parsing

Repository files navigation

Fast Data Processing Using Tries and State Machines

CS 547 High Performance Computing @ Binghamton University

release date 2024-02-11 (piazza)

Standard Implementation

  • The standard implementation uses the C++17 STL and Regexes to compute results. While convenient, the C++ stdlib can be slow at times.

Optimized Implementation

  • The optmized implementation uses a state machine for parsing and a table-driven trie for storage and comparison of parse results.

Results

Note: 3 runs per test file and approximated the execution speed.

Lines File Size standard.cpp parser (s) optimized.cpp (ms)
3 76 B ~1 ms 183 ms
1_000 29 KB 4 ms 201 ms
10_000 290 KB 41 ms 424 ms
1_000_000 29 MB 3_181 ms 1_152 ms
10_000_000 289.7 MB 32_130 ms 6_793 ms

Parsing Times for a Regex Implementation vs. Table Driven Trie Implementation.

Submission

  • DEADLINE:
    • 2024-03-12 11:59PM (piazza)
    • 2024-03-03 11:59PM (piazza)
    • 2024-02-29 11:59PM

Execution

Standard Implementation

To compile and execute the standard implementation:

make standard && ./build/standard test/data

Optimized Implementation

To compile and execute the optimized implementation:

make optimized && ./build/optimized test/data

Generate Test Data

A C++ program was provided by the professor to generate test data for our program.

The same file was downloaded and added to this repo and renamed to generate.cpp. To generate a test file:

g++ -std=c++17 -o generate generate.cpp && lines= && ./generate "${lines}" > "data_${lines}"

For testing, testing data was redirected into test/. For further description, within the test directory an extension was added to describe which implementation {.optimized, .standard} generated the results.

Source

About

Optimized parsing using state machines and tries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published