Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle refs/convert/parquet and PR revision correctly in hffs #1712

Merged
merged 5 commits into from
Oct 5, 2023

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Oct 5, 2023

Solves #1710 by processing refs/convert/parquet and refs/pr/(\d)+ as special revisions when resolving a path. Also make sure that the unresolve counterpart works the same. Also added some tests for it.

Use case from @Hakimovich99 now works properly without having to url-encode the revision. This time I tested with jamescalam/llama-2-arxiv-papers-chunked dataset which doesn't have parquet files on main revision:

>>> from dask import dataframe as dd
>>> dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked@refs/convert/parquet")
# Dask DataFrame Structure:
#                  doi chunk-id   chunk      id   title summary  source authors categories comment journal_ref primary_category published updated references
# npartitions=1                                                                                                                                              
#               object   object  object  object  object  object  object  object     object  object      object           object    object  object     object
#                  ...      ...     ...     ...     ...     ...     ...     ...        ...     ...         ...              ...       ...     ...        ...
# Dask Name: read-parquet, 1 graph layer

>>> dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked@refs/convert/parquet/default/train/0000.parquet")
# Dask DataFrame Structure:
#                   doi chunk-id   chunk      id   title summary  source authors categories comment journal_ref primary_category published updated references
# npartitions=1                                                                                                                                              
#                object   object  object  object  object  object  object  object     object  object      object           object    object  object     object
#                   ...      ...     ...     ...     ...     ...     ...     ...        ...     ...         ...              ...       ...     ...        ...
# Dask Name: read-parquet, 1 graph layer

This should be a small QOL improvements for datasets users while not degrading the experience for others.

@Wauplin Wauplin requested review from lhoestq and mariosasko October 5, 2023 10:12
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Oct 5, 2023

The documentation is not available anymore as the PR was closed or merged.

@lhoestq
Copy link
Member

lhoestq commented Oct 5, 2023

Btw forgot to mention this in the other issue but using the ~parquet alias works:

dd.read_parquet("hf://datasets/jamescalam/llama-2-arxiv-papers-chunked@~parquet")

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 5, 2023

Btw forgot to mention this in the other issue but using the ~parquet alias works:

Aaaah, right. Completely forgot as well. I think that's why I thought we already talked about simplifying the revision (as mentioned in #1707 (comment)). Anyway, I think handling parquets + PR refs differently is also good in addition to of the ~parquet alias.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me !

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 5, 2023

Thanks for the review! Failing tests are unrelated so I'll merge now :)

@Wauplin Wauplin merged commit cbcd8b2 into main Oct 5, 2023
@Wauplin Wauplin deleted the 1710-handle-refs-convert-parquet-revision-correctly branch October 5, 2023 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants