I use the marathon data that the New York Times article What Good Marathons and Bad Investments Have in Common used.
They provide links to the entire data of almost ten million records in csv from box.com. I have removed a few columns and provided two formats from dropbox.
You can find the same data in .feather
and .parquet
formats in this repository's arrow
folder.
- initial_setup.R provides the script that drops columns from the original source.
- create_arrow.R provides an example of converting a large file from
.sas7bdat
to.feather
and.parquet
. The results are inarrow
. - data_digest.R provides size and parsing time for each format.
- create_arrow.py provides an example of converting a large file from
.sas7bdat
to.feather
and.parquet
. The results are inpy_arrow
- data_digest.R provides sizes and parsing for
.sas7bdat
and.parquet
.
The explore_bigdata.R file provides a short example.