Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle all 'how' values in df.join #249

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Thomzoy
Copy link

@Thomzoy Thomzoy commented Jan 17, 2025

Fixes #248

@Thomzoy Thomzoy requested a review from eakmanrq as a code owner January 17, 2025 15:02
@eakmanrq
Copy link
Owner

eakmanrq commented Jan 18, 2025

Thanks for contributing this @Thomzoy!

It would be great if we could at least get the left/right semi/anti cases added to the tests. Specifically the tests which demonstrate SQLFrame behaving the same as PySpark. See example here of a test showing that no on results in a cross join:

def test_join_no_on(
pyspark_employee: PySparkDataFrame,
pyspark_store: PySparkDataFrame,
get_df: t.Callable[[str], BaseDataFrame],
compare_frames: t.Callable,
):
# No on results in a cross. Testing that "how" is ignored
pyspark_employee = pyspark_employee.select(F.col("fname"), F.col("lname"))
pyspark_store = pyspark_store.select(F.col("store_id"), F.col("num_sales"))
employee = get_df("employee").select(SF.col("fname"), SF.col("lname"))
store = get_df("store").select(SF.col("store_id"), SF.col("num_sales"))
df = pyspark_employee.join(pyspark_store, how="inner")
dfs = employee.join(store, how="inner")
compare_frames(df, dfs)

Let me know if this is something you are willing to do and if not I will look into opening a new PR with the change + tests added.

@Thomzoy Thomzoy force-pushed the handle_all_join_cases branch from aa95d22 to ec60f7c Compare January 21, 2025 14:28
@Thomzoy
Copy link
Author

Thomzoy commented Jan 21, 2025

Included some test ! Another handled edge case: Spark treats a cross join with predicate as an inner join

@eakmanrq
Copy link
Owner

Included some test ! Another handled edge case: Spark treats a cross join with predicate as an inner join

Nice! The different implicit behavior of Spark is hard to match so thanks for fixing that!

@eakmanrq eakmanrq enabled auto-merge (squash) January 22, 2025 05:18
@eakmanrq eakmanrq disabled auto-merge January 22, 2025 05:18
@eakmanrq
Copy link
Owner

@Thomzoy Looks like there is a merge conflict. Can you fix that and we will get this in?

@Thomzoy Thomzoy force-pushed the handle_all_join_cases branch from 8e50799 to e7e527a Compare January 22, 2025 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All join clauses not handled properly
2 participants