You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be nice not to introduce new functions, I guess? But is having an append argument in write_parquet() better? I am not sure.
Is there an API that we can use to concatenate multiple files and also do appending?
Appending to a file potentially needs specific parameters, e.g. for matching columns. So maybe a new function that deals with both appending and concatenation is best?
where file is the output file to append to (might or might not exist), and ... are data frames or parquet files to append to it.
As for the row groups and pages to create, we can do something like
if not too many rows together, then add to the last existing row group;
if not too many rows together, then add to the last existing page within that row group,
Otherwise create a new row group or a new page.
Some types will be difficult to merge, e.g. how do we merge factor levels for factor columns? Similarly, what do we do with ENUM columns in Parquet files? Merge dictionaries if we are merging pages?
We can start with something simple:
add a single DF to an existing Parquet file.
does not have to be able to merge all column types.
Or more generally, concatenate Parquet files and data frames. Should be pretty simple to implement, if we can have a reasonable API.
The text was updated successfully, but these errors were encountered: