Append to a parquet file #56

gaborcsardi · 2024-06-04T14:42:58Z

Or more generally, concatenate Parquet files and data frames. Should be pretty simple to implement, if we can have a reasonable API.

gaborcsardi · 2024-06-06T19:33:43Z

It would be nice not to introduce new functions, I guess? But is having an append argument in write_parquet() better? I am not sure.

Is there an API that we can use to concatenate multiple files and also do appending?

Appending to a file potentially needs specific parameters, e.g. for matching columns. So maybe a new function that deals with both appending and concatenation is best?

gaborcsardi · 2024-06-15T10:03:04Z

We could have

append_parquet(file, ..., options = parquet_options())

where file is the output file to append to (might or might not exist), and ... are data frames or parquet files to append to it.

As for the row groups and pages to create, we can do something like

if not too many rows together, then add to the last existing row group;
if not too many rows together, then add to the last existing page within that row group,

Otherwise create a new row group or a new page.

Some types will be difficult to merge, e.g. how do we merge factor levels for factor columns? Similarly, what do we do with ENUM columns in Parquet files? Merge dictionaries if we are merging pages?

We can start with something simple:

add a single DF to an existing Parquet file.
does not have to be able to merge all column types.
always create a new page in the last row group.

gaborcsardi added the feature a feature request or enhancement label Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append to a parquet file #56

Append to a parquet file #56

gaborcsardi commented Jun 4, 2024 •

edited

Loading

gaborcsardi commented Jun 6, 2024

gaborcsardi commented Jun 15, 2024

Append to a parquet file #56

Append to a parquet file #56

Comments

gaborcsardi commented Jun 4, 2024 • edited Loading

gaborcsardi commented Jun 6, 2024

gaborcsardi commented Jun 15, 2024

gaborcsardi commented Jun 4, 2024 •

edited

Loading