Something regarding the performance #164

KRJackLee · 2024-06-26T14:29:06Z

KRJackLee
Jun 26, 2024

    auto start = std::chrono::high_resolution_clock::now();
    rapidcsv::Document doc(filename);
    // Stop timing
    auto stop = std::chrono::high_resolution_clock::now();
    // Calculate the duration
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);

    logInfo("Time taken by function {}: {} seconds", "rapidcsv read csv", duration.count()/1000000.0);

I evaluated the time cost of reading csv. The data is about 430 MB, 1,050,000 x 9, which costs 81.633379 seconds. Is it normal performance of rapidcsv? If not, did my syntax need improvement in applying rapidcsv?
I am using MSVC from VS 2022 + CMake built with Ninja, on windows 10.
Hardware is Intel i7-10870H @2.2GHz, 32 GB memory in a laptop.

d99kris · 2024-06-27T13:47:40Z

d99kris
Jun 27, 2024
Maintainer

Hi @KRJackLee - so rapidcsv was mainly designed to be easy to use and enable rapid development. It enables simple high-level access to read and modify CSV data, and this comes with some performance impact (the whole CSV file is read into a vector of vectors of strings - so you can imagine it's not superfast). The number you shared is probably reasonable for rapidcsv. I downloaded a random large CSV file from https://www.stats.govt.nz/large-datasets/csv-files-for-download/ - unzipped it was 818 MB (6 columns, 34959673 rows) and it took 7 seconds for rapidcsv (O2 optimization level with clang) to read on my MacBook Pro (M2 Pro).

There are currently no performance-improving flags to use with rapidcsv. I once wanted to add a read-only mode, which could improve performance quite a bit, but I have not really had the need for it myself, so I haven't looked into it.. yet.. Anyway, for maximum performance it will be faster with a custom-written parser, or some library that leaves more handling up to the application.

2 replies

KRJackLee Jun 27, 2024
Author

Thanks for your numbers, which reminds me that there may be some optimization options available for my current environment and compilers. Your performance statistics gave me a baseline for space for improvement for read time.

KRJackLee Jun 27, 2024
Author

My ignorance. I didn't realize that I have been built all my projects in CMake in Debug mode rather than Release, missing O2 all the time. With O2, the performance jumps to 3.9 sec for the 430 MB data. Case closed then. Thanks @d99kris so much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Something regarding the performance #164

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Something regarding the performance #164

KRJackLee Jun 26, 2024

Replies: 1 comment · 2 replies

d99kris Jun 27, 2024 Maintainer

KRJackLee Jun 27, 2024 Author

KRJackLee Jun 27, 2024 Author

KRJackLee
Jun 26, 2024

Replies: 1 comment 2 replies

d99kris
Jun 27, 2024
Maintainer

KRJackLee Jun 27, 2024
Author

KRJackLee Jun 27, 2024
Author