refactor: Change join_where semantics #18640

ritchie46 · 2024-09-09T17:49:17Z

@cmdlineluser we decided to go for your initial intuition.

That is unambiguous and more versatile. It is not dependent on the order of the comparisons for one.

crates/polars-plan/src/plans/conversion/join.rs

codecov · 2024-09-09T18:28:56Z

Codecov Report

Attention: Patch coverage is 85.00000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 79.92%. Comparing base (433d6c0) to head (52bf5bc).
Report is 52 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-plan/src/dsl/expr.rs	28.57%	15 Missing ⚠️
crates/polars-plan/src/plans/conversion/join.rs	94.95%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #18640      +/-   ##
==========================================
- Coverage   79.93%   79.92%   -0.01%     
==========================================
  Files        1506     1506              
  Lines      203053   203139      +86     
  Branches     2891     2891              
==========================================
+ Hits       162306   162367      +61     
- Misses      40197    40222      +25     
  Partials      550      550

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kszlim · 2024-09-09T18:35:33Z

Curious if there could be a pl.Expr.right and/or pl.Expr.left method that marks that your expression should be evaluated on a theoretical left or right table in a join context. (This could then also be usable in a regular join on instead of left_on and right_on (as well as for join_asof). Alternatively, it might be possible to tie expressions to tables via something like #18592 but idk if that's a direction polars would want to go. My proposal would tie the expression to the context it's evaluated in instead.

Having it work with column selectors would also be nice. In the case of conflicting columns, it's a little annoying needing to do my_expr.name.suffix("_right") as opposed to something that does the above. It'd be nice if it doesn't affect the output name too.

The currently proposed api feels a little implicit, as it requires that the user knows that the right tables columns will be implicitly suffixed with "_right" in the case of conflicts.

Specifically i think that pl.Expr.right and pl.Expr.left is better than pl.right() or pl.left() (as an alternative to pl.col), as it will let expressions be more reusable.

ritchie46 · 2024-09-09T19:24:11Z

The currently proposed api feels a little implicit, as it requires that the user knows that the right tables columns will be implicitly suffixed with "_right" in the case of conflicts.

Think of it as applying the predicates post join. It's got clear semantics, is non-ambiguous and doesn't require a new type for only this context.

my_expr.name.suffix("_right")

I don't follow, you should never use a suffix expression here.

kszlim · 2024-09-09T19:34:30Z

Sorry, that was a mistake, I initially thought all columns will be suffixed on the right table by default.

Though you will have to manually suffix expressions that you're reusing.

Ie. Imagine you're not just doing a selection, but running some computation. I'll post an example when I get to my computer.

kszlim · 2024-09-09T20:12:59Z

two_b = pl.col("b")
two_a = pl.col("a")
df = pl.DataFrame({"a": [1,2,3], "b": [4,5,6]})
df.join_where(df, two_a > two_a.alias("a_right"), two_b > two_b.alias("b_right"))

And then you're more likely to make a mistake with your alias, whereas you could potentially have this instead:

two_b = pl.col("b")
two_a = pl.col("a")
df = pl.DataFrame({"a": [1,2,3], "b": [4,5,6]})
df.join_where(df, two_a.left() > two_a.right(), two_b.left() > two_b.right())

But i think the main benefit is that you could start using left and right in other joins (allowing you to reduce the chance of making mistakes with the left_on, right_on keys (and also making the join more expressive), it would also potentially let you unify join_where and join (not sure about this)?

b = pl.col("b")
a = pl.col("a")
df = pl.DataFrame({"a": [1,2,3], "b": [2,3,4]})
df.join(df, on=[a.left() == b.right()], how='semi')

Getting you

┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 3   │
│ 3   ┆ 4   │
└─────┴─────┘

Also if you're doing something funky where your join keys are programmatic computed, it's easier to handle:

df_0 = pl.DataFrame({"a": [1,2,3], "b": [2,3,4]})
df_1 = pl.DataFrame({"c": [1,2,3], "b": [2,3,4]})

left_table_exprs = [pl.col("a"), pl.col("b")]
right_table_exprs = [pl.col("b"), pl.col("c")]
join_exprs = []
for (left, right) in zip(left_table_exprs, right_table_exprs):
  if right.meta.output_name() in df_0.columns:
    right = right.name.suffix("_right")
  join_exprs.append(left.gt(right)) # this is just an arbitrary example
  
df_0.join(df_1, *join_exprs)

With pl.Expr.right and pl.Expr.left, you can avoid the conditional check for conflicts.

ritchie46 · 2024-09-10T05:56:57Z

The left on a whole expression is not future proof you are/will be allowed to refer different tables in one expression.

I will go with this as it's unambiguous, which I cannot understate how important that is.

I also like that it now follows the SQL semantics. E.g. semantically you do nested loop join and filter out the predicates.

The predicates are post join. So on the names after the join operation. That's what the semanticis will be.

refactor: Change join_where semantics

4dc582c

ritchie46 requested review from c-peters, alexander-beedie, MarcoGorelli, reswqa and orlp as code owners September 9, 2024 17:49

github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels Sep 9, 2024

alexander-beedie reviewed Sep 9, 2024

View reviewed changes

crates/polars-plan/src/plans/conversion/join.rs Outdated Show resolved Hide resolved

typo

52bf5bc

cmdlineluser mentioned this pull request Sep 9, 2024

Change join_where to not allow ambiguous naming but do allow interchangeable order #18634

Closed

ritchie46 merged commit 45c8e96 into main Sep 10, 2024
27 checks passed

ritchie46 deleted the join branch September 10, 2024 05:57

AlexeyDmitriev mentioned this pull request Sep 15, 2024

Ambigiuous column names in join_where require post-join names #18752

Closed

2 tasks

c-peters added the accepted Ready for implementation label Sep 16, 2024

c-peters assigned ritchie46 Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Change join_where semantics #18640

refactor: Change join_where semantics #18640

ritchie46 commented Sep 9, 2024

codecov bot commented Sep 9, 2024 •

edited

Loading

kszlim commented Sep 9, 2024 •

edited

Loading

ritchie46 commented Sep 9, 2024

kszlim commented Sep 9, 2024

kszlim commented Sep 9, 2024 •

edited

Loading

ritchie46 commented Sep 10, 2024

refactor: Change join_where semantics #18640

refactor: Change join_where semantics #18640

Conversation

ritchie46 commented Sep 9, 2024

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

kszlim commented Sep 9, 2024 • edited Loading

ritchie46 commented Sep 9, 2024

kszlim commented Sep 9, 2024

kszlim commented Sep 9, 2024 • edited Loading

ritchie46 commented Sep 10, 2024

codecov bot commented Sep 9, 2024 •

edited

Loading

kszlim commented Sep 9, 2024 •

edited

Loading

kszlim commented Sep 9, 2024 •

edited

Loading