Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize parquet output for remote reading #263

Open
H-Plus-Time opened this issue Aug 2, 2024 · 2 comments
Open

Optimize parquet output for remote reading #263

H-Plus-Time opened this issue Aug 2, 2024 · 2 comments

Comments

@H-Plus-Time
Copy link

TLDR: the primary pain point here is huge (in terms of total uncompressed byte size) row groups - writing the PageIndex OR reducing row group sizes, perhaps both, would help a lot.

Basically, the defaults in pyarrow (and most parquet implementations) for row group sizes (1 million rows per row group) are predicated on assumptions about what a typical parquet file looks like (lots of numerics, booleans, relatively short strings amenable to dictionary, RLE and delta encoding); wide text datasets are very much not typical, and default row group sizes get you ~2GB per row group (and nearly 4GB uncompressed, just for the text column).

The simplest thing to do would be to default to 100k for the row_group_size parameter - more or less the inflection point of this benchmark by DuckDB (size overhead is about 0.5%).

Setting write_page_index to true should help a great deal (arguably much more than smaller row groups), as readers can use that to refine reads to individual data pages (not unusual for point lookups to hit 0.1% of a file).

@guipenedo
Copy link
Collaborator

Thanks for bringing this up. Would you be willing to implement this change in a PR?

@H-Plus-Time
Copy link
Author

Yep, should have something over the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants