-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Change join_where semantics #18640
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18640 +/- ##
==========================================
- Coverage 79.93% 79.92% -0.01%
==========================================
Files 1506 1506
Lines 203053 203139 +86
Branches 2891 2891
==========================================
+ Hits 162306 162367 +61
- Misses 40197 40222 +25
Partials 550 550 ☔ View full report in Codecov by Sentry. |
Curious if there could be a Having it work with column selectors would also be nice. In the case of conflicting columns, it's a little annoying needing to do The currently proposed api feels a little implicit, as it requires that the user knows that the right tables columns will be implicitly suffixed with "_right" in the case of conflicts. Specifically i think that |
Think of it as applying the predicates post join. It's got clear semantics, is non-ambiguous and doesn't require a new type for only this context.
I don't follow, you should never use a |
Sorry, that was a mistake, I initially thought all columns will be suffixed on the right table by default. Though you will have to manually suffix expressions that you're reusing. Ie. Imagine you're not just doing a selection, but running some computation. I'll post an example when I get to my computer. |
two_b = pl.col("b")
two_a = pl.col("a")
df = pl.DataFrame({"a": [1,2,3], "b": [4,5,6]})
df.join_where(df, two_a > two_a.alias("a_right"), two_b > two_b.alias("b_right")) And then you're more likely to make a mistake with your alias, whereas you could potentially have this instead: two_b = pl.col("b")
two_a = pl.col("a")
df = pl.DataFrame({"a": [1,2,3], "b": [4,5,6]})
df.join_where(df, two_a.left() > two_a.right(), two_b.left() > two_b.right()) But i think the main benefit is that you could start using b = pl.col("b")
a = pl.col("a")
df = pl.DataFrame({"a": [1,2,3], "b": [2,3,4]})
df.join(df, on=[a.left() == b.right()], how='semi') Getting you
Also if you're doing something funky where your join keys are programmatic computed, it's easier to handle: df_0 = pl.DataFrame({"a": [1,2,3], "b": [2,3,4]})
df_1 = pl.DataFrame({"c": [1,2,3], "b": [2,3,4]})
left_table_exprs = [pl.col("a"), pl.col("b")]
right_table_exprs = [pl.col("b"), pl.col("c")]
join_exprs = []
for (left, right) in zip(left_table_exprs, right_table_exprs):
if right.meta.output_name() in df_0.columns:
right = right.name.suffix("_right")
join_exprs.append(left.gt(right)) # this is just an arbitrary example
df_0.join(df_1, *join_exprs) With |
The I will go with this as it's unambiguous, which I cannot understate how important that is. I also like that it now follows the SQL semantics. E.g. semantically you do nested loop join and filter out the predicates. The predicates are post join. So on the names after the join operation. That's what the semanticis will be. |
@cmdlineluser we decided to go for your initial intuition.
That is unambiguous and more versatile. It is not dependent on the order of the comparisons for one.