Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronise with dask-expr, newer Dask and newer deltalake #69

Merged
merged 9 commits into from
Jul 17, 2024

Conversation

fjetter
Copy link
Contributor

@fjetter fjetter commented Feb 13, 2024

This migrates both read and write jobs to dask-expr. It would not respect the dask/dask switch for query-planning. Putting that switch in would not be very hard but I figured since this is still a bit of a niche project it's fine to not having that opt-out. Thoughts?

With this, the reader should support projections but not filters to IO level but I think this is something we haven't even implemented in proper dask-expr yet. Implementing a proper reader would also be possible but would require more work

@codecov-commenter
Copy link

codecov-commenter commented Feb 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (cbe085a) 74.06% compared to head (3eb0c53) 73.98%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #69      +/-   ##
==========================================
- Coverage   74.06%   73.98%   -0.09%     
==========================================
  Files           6        6              
  Lines         320      319       -1     
==========================================
- Hits          237      236       -1     
  Misses         83       83              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

}
graph = HighLevelGraph.from_collections(final_name, dsk, dependencies=(written,)) # type: ignore
result = Scalar(graph, final_name, "")
result = dask.delayed(_commit, name="deltatable-commit")(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is doing what it's supposed to. I'll add a couple tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not calling optimise, so it's indeed not working properly

PR to fix: dask/dask#11231

Copy link
Collaborator

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little premature given that dask-expr is going to be merged into dask/dask main at some point (and I assume the repo/package will be deprecated?). Or am I misunderstanding the plan?

I agree that given this is a young project with a small user base that not having a switch to go back is fine. You can always downgrade versions if you want to go back.

@mrocklin
Copy link
Contributor

@jacobtomlinson I think that the plan is to switch the default quite soon.

As far as we know, no one uses this library, so I think that it makes sense to aim to the future.

@jacobtomlinson
Copy link
Collaborator

I know users of this library, but I don't think it's a surprise that this repo is under active development so large changes aren't an issue. They are likely hard pinning.

I was more surprised about adding the Dask-expr dependency because I assumed it would be going away in the near future.

@mrocklin
Copy link
Contributor

@fjetter can say more, but I wouldn't be surprised if development stays in that repo for the next few months, even while dask.dataframe depends on it. dask[dataframe] will likely depend on dask-expr for a while.

@jacobtomlinson
Copy link
Collaborator

Ok that makes more sense! Thanks for clarifying.

@fjetter
Copy link
Contributor Author

fjetter commented Feb 14, 2024

First of all, I'll revert this to a draft PR again because I have to figure out a couple of things around delayed.

Second, we can make this PR respect the toggle we're using in dask/dask as well if that's what we want to do. I suspect the timeline until main dask switches the default is a matter of (few) weeks so I was trying to avoid a lot of compat code. If the delayed thing can be resolved, we'll not need a lot of compat and making this all optional will be very easy.

Lastly, the dask-expr package will stay, at least for a little while. We'll have to be more careful about compatibility but by having this separate we'll have an even easier time releasing bug fixes. Once this stabilizes, the separate package will vanish. I suspect this is on a "months" rather than "weeks" timeline.

@fjetter fjetter marked this pull request as draft February 14, 2024 14:23
@jacobtomlinson
Copy link
Collaborator

That all sounds great. I wouldn't bother with compat here, it's probably not worth the investment given the size of the userbase.

@mrocklin
Copy link
Contributor

I ran into wanting this today. Just pinging here so that it doesn't get forgotten (but I understand that it's probably not a sufficiently high priority at the moment). Mostly I want to ensure that it's on some backlog somewhere.

@phofl phofl changed the title Make dask-expr mandatory Synchronise with dask-expr and newer Dask Jun 26, 2024
@phofl phofl changed the title Synchronise with dask-expr and newer Dask Synchronise with dask-expr, newer Dask and newer deltalake Jun 26, 2024
@phofl
Copy link
Collaborator

phofl commented Jun 26, 2024

So ci is green for the new deltalake version, we don't need any special casing for dask-expr anymore either.

cc @fjetter could you add more context what you were referring to here:

#69 (comment)

@phofl phofl marked this pull request as ready for review July 17, 2024 08:06
@phofl
Copy link
Collaborator

phofl commented Jul 17, 2024

This is now ready for review

@fjetter fjetter merged commit 7706c22 into dask-contrib:main Jul 17, 2024
13 checks passed
@fjetter fjetter deleted the dask_expr branch July 17, 2024 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants