Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet scanner doesn't do predicate pushdown for categoricals/enums #18868

Open
2 tasks done
deanm0000 opened this issue Sep 23, 2024 · 1 comment
Open
2 tasks done
Labels
A-io Area: reading and writing data bug Something isn't working P-medium Priority: medium performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars

Comments

@deanm0000
Copy link
Collaborator

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.DataFrame([
    pl.Series('a',['a','b','c'], pl.Categorical)]).write_parquet('catpa.parquet',row_group_size=1, use_pyarrow=True)
with pl.Config(verbose=True):
    pl.scan_parquet("catpa.parquet").filter(pl.col('a')=='a').collect()

Log output

parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.

Issue description

This is separate from but highly related to #18867. Even when using a file written by pyarrow where the statistics are correct, predicate pushdown doesn't work.

If I try to explicitly make the rhs a Categorical then I simply don't get a verbose message at all so I'm not sure if it's silently working or not working.

with pl.Config(verbose=True):
    print(pl.scan_parquet("catpa.parquet").filter(pl.col('a')==pl.lit('a',pl.Categorical)).collect())

Even with a StringCache still no verbosity.

with pl.Config(verbose=True), pl.StringCache():
    print(pl.scan_parquet("catpa.parquet").filter(pl.col('a')==pl.lit('a',pl.Categorical)).collect())

Expected behavior

Partition pruning should work

Installed versions

--------Version info---------
Polars:              1.7.1
Index type:          UInt32
Platform:            Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.11.9 (main, Apr  6 2024, 17:59:24) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           0.3.2
deltalake            0.18.2
fastexcel            <not installed>
fsspec               2024.3.1
gevent               <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@deanm0000 deanm0000 added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Sep 23, 2024
@aberres
Copy link
Contributor

aberres commented Sep 23, 2024

The same happens with enum columns.

@deanm0000 deanm0000 changed the title Parquet scanner doesn't do predicate pushdown for categoricals Parquet scanner doesn't do predicate pushdown for categoricals/enums Sep 23, 2024
@deanm0000 deanm0000 added P-medium Priority: medium A-io Area: reading and writing data performance Performance issues or improvements rust Related to Rust Polars and removed needs triage Awaiting prioritization by a maintainer labels Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data bug Something isn't working P-medium Priority: medium performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
Status: Ready
Development

No branches or pull requests

2 participants