cpp-parquet

This is a Parquet file writer in C++ that I wrote in 2013. There's a newer, more complete, and better supported alternative here: https://github.com/apache/parquet-cpp. A frequent question I get about this is why I didn't use C++ templates. The primary reason is each Parquet data type does not require much specialization for outputting or handling. The output logic has conditionals depending on whether the column is part of another message, repeated, nullable, etc, but not a lot of dependencies on the specific data type other than its size.

March 7th, 2015

You will need the following dependencies pre-installed:

Thrift (which requires Boost)
Avro (just the C++ API)
CMake 3.0.2

From the root of this repo's clone on your machine:

$ cmake .
$ make

Examples

parquet-file-driver.cc is an example program that uses the API provided by CPP-Parquet. Specify an output filename and the number of items.
avro-schema-walker-example.cc is an example program that takes an AVRO schema in JSON and translates it to the structures that this library needs to write Parquet files.
parquet-file-writer.cc isn't dependent on this project, but uses Parquet Thrift definitions directly to write a Parquet file - I wrote it to "learn by doing" and may be useful to read to understand Parquet itself.

parquet-file-driver

cd examples
rm test.parquet
./cppparquet test.parquet 500

avro-schema-walker

This takes a simple AVRO schema in JSON (provided in simple.json) and writes a parquet-file with the same schema. No data is dumped, but you can view the schema using parquet-schema:

cd examples
rm test.parquet
./avro-schema-walker simple.json
parquet-schema test.parquet

Unit Tests

To run unit tests with logging:

$ GLOG_v=2 GLOG_alsologtostderr=true ./parquet-file/parquet-file-test

If you have parquet-tools or parquet-mr installed (parquet-tools is a project that was merged into parquet, so your version of parquet-mr has to be somewhat recent), you can also run it like this:

$ PARQUET_DUMP_PATH=<path to parquet-dump executable> GLOG_alsologtostderr=true ./parquet-file-test

Which will generate 'golden files' in the current directory. Currently these files aren't used for anything, but they can be useful to view the output of parquet-dump on the temp files the test cases create. In the future, they will be part of the repo and will be used to validate each test case run by comparing each test case run with the golden file.

Setting GLOG_v=3 might produce useful output in some cases, but is pretty verbose (i.e. each individual rep/def level).

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
avro-schema		avro-schema
cmake/Modules		cmake/Modules
examples		examples
parquet-file		parquet-file
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cpp-parquet

Examples

parquet-file-driver

avro-schema-walker

Unit Tests

About

Releases

Packages

Languages

nealsid/cpp-parquet

Folders and files

Latest commit

History

Repository files navigation

cpp-parquet

Examples

parquet-file-driver

avro-schema-walker

Unit Tests

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages