Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Change join_where semantics #18640

Merged
merged 2 commits into from
Sep 10, 2024
Merged

refactor: Change join_where semantics #18640

merged 2 commits into from
Sep 10, 2024

Conversation

ritchie46
Copy link
Member

@cmdlineluser we decided to go for your initial intuition.

That is unambiguous and more versatile. It is not dependent on the order of the comparisons for one.

@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels Sep 9, 2024
Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 85.00000% with 21 lines in your changes missing coverage. Please review.

Project coverage is 79.92%. Comparing base (433d6c0) to head (52bf5bc).
Report is 52 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-plan/src/dsl/expr.rs 28.57% 15 Missing ⚠️
crates/polars-plan/src/plans/conversion/join.rs 94.95% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18640      +/-   ##
==========================================
- Coverage   79.93%   79.92%   -0.01%     
==========================================
  Files        1506     1506              
  Lines      203053   203139      +86     
  Branches     2891     2891              
==========================================
+ Hits       162306   162367      +61     
- Misses      40197    40222      +25     
  Partials      550      550              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kszlim
Copy link
Contributor

kszlim commented Sep 9, 2024

Curious if there could be a pl.Expr.right and/or pl.Expr.left method that marks that your expression should be evaluated on a theoretical left or right table in a join context. (This could then also be usable in a regular join on instead of left_on and right_on (as well as for join_asof). Alternatively, it might be possible to tie expressions to tables via something like #18592 but idk if that's a direction polars would want to go. My proposal would tie the expression to the context it's evaluated in instead.

Having it work with column selectors would also be nice. In the case of conflicting columns, it's a little annoying needing to do my_expr.name.suffix("_right") as opposed to something that does the above. It'd be nice if it doesn't affect the output name too.

The currently proposed api feels a little implicit, as it requires that the user knows that the right tables columns will be implicitly suffixed with "_right" in the case of conflicts.

Specifically i think that pl.Expr.right and pl.Expr.left is better than pl.right() or pl.left() (as an alternative to pl.col), as it will let expressions be more reusable.

@ritchie46
Copy link
Member Author

The currently proposed api feels a little implicit, as it requires that the user knows that the right tables columns will be implicitly suffixed with "_right" in the case of conflicts.

Think of it as applying the predicates post join. It's got clear semantics, is non-ambiguous and doesn't require a new type for only this context.

my_expr.name.suffix("_right")

I don't follow, you should never use a suffix expression here.

@kszlim
Copy link
Contributor

kszlim commented Sep 9, 2024

Sorry, that was a mistake, I initially thought all columns will be suffixed on the right table by default.

Though you will have to manually suffix expressions that you're reusing.

Ie. Imagine you're not just doing a selection, but running some computation. I'll post an example when I get to my computer.

@kszlim
Copy link
Contributor

kszlim commented Sep 9, 2024

two_b = pl.col("b")
two_a = pl.col("a")
df = pl.DataFrame({"a": [1,2,3], "b": [4,5,6]})
df.join_where(df, two_a > two_a.alias("a_right"), two_b > two_b.alias("b_right"))

And then you're more likely to make a mistake with your alias, whereas you could potentially have this instead:

two_b = pl.col("b")
two_a = pl.col("a")
df = pl.DataFrame({"a": [1,2,3], "b": [4,5,6]})
df.join_where(df, two_a.left() > two_a.right(), two_b.left() > two_b.right())

But i think the main benefit is that you could start using left and right in other joins (allowing you to reduce the chance of making mistakes with the left_on, right_on keys (and also making the join more expressive), it would also potentially let you unify join_where and join (not sure about this)?

b = pl.col("b")
a = pl.col("a")
df = pl.DataFrame({"a": [1,2,3], "b": [2,3,4]})
df.join(df, on=[a.left() == b.right()], how='semi')

Getting you

┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 3   │
│ 3   ┆ 4   │
└─────┴─────┘

Also if you're doing something funky where your join keys are programmatic computed, it's easier to handle:

df_0 = pl.DataFrame({"a": [1,2,3], "b": [2,3,4]})
df_1 = pl.DataFrame({"c": [1,2,3], "b": [2,3,4]})

left_table_exprs = [pl.col("a"), pl.col("b")]
right_table_exprs = [pl.col("b"), pl.col("c")]
join_exprs = []
for (left, right) in zip(left_table_exprs, right_table_exprs):
  if right.meta.output_name() in df_0.columns:
    right = right.name.suffix("_right")
  join_exprs.append(left.gt(right)) # this is just an arbitrary example
  
df_0.join(df_1, *join_exprs)

With pl.Expr.right and pl.Expr.left, you can avoid the conditional check for conflicts.

@ritchie46
Copy link
Member Author

The left on a whole expression is not future proof you are/will be allowed to refer different tables in one expression.

I will go with this as it's unambiguous, which I cannot understate how important that is.

I also like that it now follows the SQL semantics. E.g. semantically you do nested loop join and filter out the predicates.

The predicates are post join. So on the names after the join operation. That's what the semanticis will be.

@ritchie46 ritchie46 merged commit 45c8e96 into main Sep 10, 2024
27 checks passed
@ritchie46 ritchie46 deleted the join branch September 10, 2024 05:57
@c-peters c-peters added the accepted Ready for implementation label Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants