Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing unnecessary projection for single key join followed by .select(pl.len()) #18839

Open
henryharbeck opened this issue Sep 21, 2024 · 1 comment
Labels
A-optimizer Area: plan optimization accepted Ready for implementation enhancement New feature or an improvement of an existing feature performance Performance issues or improvements

Comments

@henryharbeck
Copy link
Contributor

Description

I'm unsure why the optimised plan projects 2/3 columns instead of one for this single key join followed by .select(pl.len())

This example shows a case where I believe only a single column needs to be projected. If I am missing something, or there is some other reason why this needs to happen, then feel free to close this.

a = pl.LazyFrame({"i": range(100)})
b = pl.LazyFrame({"i": range(100), "j": range(100), "k": range(100)})
print(
    a.join(b, on="i")
    .select(pl.len())
    .explain()
)
 SELECT [len()] FROM
  INNER JOIN:
  LEFT PLAN ON: [col("i")]
    DF ["i"]; PROJECT 1/1 COLUMNS; SELECTION: None
  RIGHT PLAN ON: [col("i")]
    DF ["i", "j", "k"]; PROJECT 2/3 COLUMNS; SELECTION: None    <---- can 1/3 columns be projected?
  END INNER JOIN

DuckDB only projects a single column (the i column) from both tables

query = """
CREATE OR REPLACE TABLE A AS SELECT range as i, range as j FROM range(100);
CREATE OR REPLACE TABLE B AS SELECT range as i, range as j, range as k FROM range(100);
EXPLAIN SELECT COUNT(*) FROM A JOIN B USING(i);
"""
print(duckdb.sql(query).pl()["explain_value"].item())
┌───────────────────────────┐
│    UNGROUPED_AGGREGATE    │
│    ────────────────────   │
│        Aggregates:        │
│        count_star()       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         HASH_JOIN         │
│    ────────────────────   │
│      Join Type: INNER     │
│     Conditions: i = i     │
│        Build Min: 0       ├──────────────┐
│       Build Max: 99       │              │
│                           │              │
│         ~102 Rows         │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│             A             ││             B             │
│                           ││                           │
│       Projections: i      ││       Projections: i      │
│                           ││                           │
│         ~100 Rows         ││         ~100 Rows         │
└───────────────────────────┘└───────────────────────────┘

@henryharbeck henryharbeck added the enhancement New feature or an improvement of an existing feature label Sep 21, 2024
@ritchie46
Copy link
Member

I think we can improve that. Currently the len triggers a read of at least the first column of the schema. We need somehow to backtrack later if that column is really required or not.

@nameexhaustion nameexhaustion added performance Performance issues or improvements accepted Ready for implementation A-optimizer Area: plan optimization labels Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-optimizer Area: plan optimization accepted Ready for implementation enhancement New feature or an improvement of an existing feature performance Performance issues or improvements
Projects
Status: Ready
Development

No branches or pull requests

3 participants