-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronise with dask-expr, newer Dask and newer deltalake #69
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #69 +/- ##
==========================================
- Coverage 74.06% 73.98% -0.09%
==========================================
Files 6 6
Lines 320 319 -1
==========================================
- Hits 237 236 -1
Misses 83 83 ☔ View full report in Codecov by Sentry. |
} | ||
graph = HighLevelGraph.from_collections(final_name, dsk, dependencies=(written,)) # type: ignore | ||
result = Scalar(graph, final_name, "") | ||
result = dask.delayed(_commit, name="deltatable-commit")( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is doing what it's supposed to. I'll add a couple tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not calling optimise, so it's indeed not working properly
PR to fix: dask/dask#11231
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a little premature given that dask-expr
is going to be merged into dask/dask
main at some point (and I assume the repo/package will be deprecated?). Or am I misunderstanding the plan?
I agree that given this is a young project with a small user base that not having a switch to go back is fine. You can always downgrade versions if you want to go back.
@jacobtomlinson I think that the plan is to switch the default quite soon. As far as we know, no one uses this library, so I think that it makes sense to aim to the future. |
I know users of this library, but I don't think it's a surprise that this repo is under active development so large changes aren't an issue. They are likely hard pinning. I was more surprised about adding the Dask-expr dependency because I assumed it would be going away in the near future. |
@fjetter can say more, but I wouldn't be surprised if development stays in that repo for the next few months, even while dask.dataframe depends on it. |
Ok that makes more sense! Thanks for clarifying. |
First of all, I'll revert this to a draft PR again because I have to figure out a couple of things around delayed. Second, we can make this PR respect the toggle we're using in dask/dask as well if that's what we want to do. I suspect the timeline until main dask switches the default is a matter of (few) weeks so I was trying to avoid a lot of compat code. If the delayed thing can be resolved, we'll not need a lot of compat and making this all optional will be very easy. Lastly, the |
That all sounds great. I wouldn't bother with compat here, it's probably not worth the investment given the size of the userbase. |
I ran into wanting this today. Just pinging here so that it doesn't get forgotten (but I understand that it's probably not a sufficiently high priority at the moment). Mostly I want to ensure that it's on some backlog somewhere. |
# Conflicts: # dask_deltatable/core.py # requirements.txt
So ci is green for the new deltalake version, we don't need any special casing for dask-expr anymore either. cc @fjetter could you add more context what you were referring to here: |
This is now ready for review |
This migrates both read and write jobs to dask-expr. It would not respect the dask/dask switch for query-planning. Putting that switch in would not be very hard but I figured since this is still a bit of a niche project it's fine to not having that opt-out. Thoughts?
With this, the reader should support projections but not filters to IO level but I think this is something we haven't even implemented in proper dask-expr yet. Implementing a proper reader would also be possible but would require more work