Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append to a parquet file #56

Open
gaborcsardi opened this issue Jun 4, 2024 · 2 comments
Open

Append to a parquet file #56

gaborcsardi opened this issue Jun 4, 2024 · 2 comments
Labels
feature a feature request or enhancement

Comments

@gaborcsardi
Copy link
Member

gaborcsardi commented Jun 4, 2024

Or more generally, concatenate Parquet files and data frames. Should be pretty simple to implement, if we can have a reasonable API.

@gaborcsardi gaborcsardi added the feature a feature request or enhancement label Jun 4, 2024
@gaborcsardi
Copy link
Member Author

It would be nice not to introduce new functions, I guess? But is having an append argument in write_parquet() better? I am not sure.

Is there an API that we can use to concatenate multiple files and also do appending?

Appending to a file potentially needs specific parameters, e.g. for matching columns. So maybe a new function that deals with both appending and concatenation is best?

@gaborcsardi
Copy link
Member Author

We could have

append_parquet(file, ..., options = parquet_options())

where file is the output file to append to (might or might not exist), and ... are data frames or parquet files to append to it.

As for the row groups and pages to create, we can do something like

  • if not too many rows together, then add to the last existing row group;
  • if not too many rows together, then add to the last existing page within that row group,

Otherwise create a new row group or a new page.

Some types will be difficult to merge, e.g. how do we merge factor levels for factor columns? Similarly, what do we do with ENUM columns in Parquet files? Merge dictionaries if we are merging pages?

We can start with something simple:

  • add a single DF to an existing Parquet file.
  • does not have to be able to merge all column types.
  • always create a new page in the last row group.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

1 participant