Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize API for writing Delta Tables #42

Closed
j-bennet opened this issue Jul 11, 2023 · 1 comment · Fixed by #46
Closed

Finalize API for writing Delta Tables #42

j-bennet opened this issue Jul 11, 2023 · 1 comment · Fixed by #46

Comments

@j-bennet
Copy link
Collaborator

j-bennet commented Jul 11, 2023

The initial API for writing Delta Lake is a little bit clunky for the user.

When reading, users have to do something like this:

from dask_deltatalbe import read_delta_table
ddf = read_delta_table("path_to_table")

To write, they need this:

from dask_deltatable.write import to_deltalake
out = to_deltalake("path_to_table", ddf)
out.compute()

TODO:

  • naming is not consistent; read_delta_table vs to_deltalake. Either of the following combos would be more consistent:
    1. read_delta_table/write_delta_table
    2. read_deltalake/write_deltalake
    3. read_delta_table/to_delta_table
    4. read_deltalake/to_deltalake
  • to_deltalake should be exposed on top level, same as read_delta_table
  • user shouldn't need to call compute as an extra step, add compute: bool kwarg instead
  • Possibly remove history() from this library #17
  • Possibly remove vacuum from this library #16
@j-bennet j-bennet mentioned this issue Jul 11, 2023
@fjetter
Copy link
Contributor

fjetter commented Jul 12, 2023

read_foo / to_foo is the standard terminology in dask. I believe this is true for all IO APIs we're offering, see https://docs.dask.org/en/stable/dataframe-api.html#create-dataframes and https://docs.dask.org/en/stable/dataframe-api.html#store-dataframes

I suggest read_deltalake and to_deltalake

to_deltalake should be exposed on top level, same as read_delta_table

+ 1

user shouldn't need to call compute.

We typically offer a compute kwarg to control this behavior. I'm fine adding this to to_deltalake as well.

@j-bennet j-bennet mentioned this issue Jul 12, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants