Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DBFSPath as os.PathLike implementation #131

Merged
merged 18 commits into from
Jul 16, 2024
Merged

Added DBFSPath as os.PathLike implementation #131

merged 18 commits into from
Jul 16, 2024

Conversation

asnare
Copy link
Contributor

@asnare asnare commented Jul 15, 2024

This PR extends the existing WorkspacePath support with a pathlib-like implementation for DBFS paths: DBFSPath

Incidental changes include:

  • Type-hinting the implementation and interfaces to assist with code linting.

… WorkspaceAPI.

In addition, type-hint the APIs as much as possible.
@asnare asnare added the enhancement New feature or request label Jul 15, 2024
@asnare asnare self-assigned this Jul 15, 2024
@nfx nfx marked this pull request as ready for review July 15, 2024 18:03
Copy link

github-actions bot commented Jul 15, 2024

✅ 31/31 passed, 2 skipped, 1m18s total

Running from acceptance #199

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost ready

src/databricks/labs/blueprint/paths.py Outdated Show resolved Hide resolved
def is_relative_to(self, other, *more_other): # pylint: disable=arguments-differ
other = self.with_segments(other, *more_other)
if self.anchor != other.anchor:
def is_relative_to(self, *other: str | bytes | os.PathLike) -> bool: # pylint: disable=arguments-differ,useless-suppression
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove useless suppression

src/databricks/labs/blueprint/paths.py Outdated Show resolved Hide resolved
src/databricks/labs/blueprint/paths.py Outdated Show resolved Hide resolved
src/databricks/labs/blueprint/paths.py Outdated Show resolved Hide resolved
Comment on lines 609 to 616
dst = self.with_segments(target)
if overwrite:
with dst.open(mode="wb") as writer, self.open(mode="rb") as reader:
shutil.copyfileobj(reader, writer, length=1024 * 1024)
self.unlink()
else:
self._ws.dbfs.move(self.as_posix(), dst.as_posix())
return dst
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, but it turns out that the overwrite argument for move_() isn't propagated/implemented for DBFS src/target pairs: if the target exists then ResourceAlreadyExists is raised.

I've added a code comment that explains what's going on here, and implemented an integration test that covers this (to aid with safely changing it in the future).

"""Remove a file in Databricks Workspace."""
if not missing_ok and not self.exists():
raise FileNotFoundError(f"{self.as_posix()} does not exist")
self._ws.dbfs.delete(self.as_posix())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._ws.dbfs.delete(self.as_posix())
try:
self._ws.dbfs.delete(self.as_posix())
except NotFound as e:
if not missing_ok:
raise e

save API calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that the DBFS API client doesn't report an error if the target doesn't exist, so I've left this in place with a comment explaining why.

I've updated the WorkspacePath version to follow this pattern though: errors are reported if the target doesn't exist so we can handle it in the unhappy path. Aside from the extra calls, it avoids an avoidable race-condition.

Integration tests now cover this behaviour.

src/databricks/labs/blueprint/paths.py Show resolved Hide resolved
src/databricks/labs/blueprint/paths.py Show resolved Hide resolved
src/databricks/labs/blueprint/paths.py Show resolved Hide resolved
@nfx nfx changed the title Implement pathlib-like support for DBFS Added DBFSPath as os.PathLike implementation Jul 15, 2024
For DBFS it doesn't fail if missing so we need to check before trying.
There's no common implementation, and DBFS needs to treat them differently anyway.
These overrides aren't needed with python 3.10, the only version we currently lint against.
Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@nfx nfx merged commit 40cb3a4 into main Jul 16, 2024
8 of 9 checks passed
@nfx nfx deleted the feature/dbfspath branch July 16, 2024 14:48
nfx added a commit that referenced this pull request Jul 16, 2024
* Added `DBFSPath` as `os.PathLike` implementation ([#131](#131)). The open-source library has been updated with a new class `DBFSPath`, an implementation of `os.PathLike` for Databricks File System (DBFS) paths. This new class extends the existing `WorkspacePath` support and provides pathlib-like functionality for DBFS paths, including methods for creating directories, renaming and deleting files and directories, and reading and writing files. The addition of `DBFSPath` includes type-hinting for improved code linting and is integrated in the test suite with new and updated tests for path-like objects. The behavior of the `exists` and `unlink` methods have been updated for `WorkspacePath` to improve performance and raise appropriate errors.
* Fixed `.as_uri()` and `.absolute()` implementations for `WorkspacePath` ([#127](#127)). In this release, the `WorkspacePath` class in the `paths.py` module has been updated with several improvements to the `.as_uri()` and `.absolute()` methods. These methods now utilize PathLib internals, providing better cross-version compatibility. The `.as_uri()` method now uses an f-string for concatenation and returns the UTF-8 encoded string representation of the `WorkspacePath` object via a new `__bytes__()` dunder method. Additionally, the `.absolute()` method has been implemented for the trivial (no-op) case and now supports returning the absolute path of files or directories in Databricks Workspace. Furthermore, the `glob()` and `rglob()` methods have been enhanced to support case-sensitive pattern matching based on a new `case_sensitive` parameter. To ensure the integrity of these changes, two new test cases, `test_as_uri()` and `test_absolute()`, have been added, thoroughly testing the functionality of these methods.
* Fixed `WorkspacePath` support for python 3.11 ([#121](#121)). The `WorkspacePath` class in our open-source library has been updated to improve compatibility with Python 3.11. The `.expanduser()` and `.glob()` methods have been modified to address internal changes in Python 3.11. The `is_dir()` and `is_file()` methods now include a `follow_symlinks` parameter, although it is not currently used. A new method, `_scandir()`, has been added for compatibility with Python 3.11. The `expanduser()` method has also been updated to expand `~` (but not `~user`) constructs. Additionally, a new method `is_notebook()` has been introduced to check if the path points to a notebook in Databricks Workspace. These changes aim to ensure that the library functions smoothly with the latest version of Python and provides additional functionality for users working with Databricks Workspace.
* Properly verify versions of python ([#118](#118)). In this release, we have made significant updates to the pyproject.toml file to enhance project dependency and development environment management. We have added several new packages to the `dependencies` section to expand the library's functionality and compatibility. Additionally, we have removed the `python` field, as it is no longer necessary. We have also updated the `path` field to specify the location of the virtual environment, which can improve integration with popular development tools such as Visual Studio Code and PyCharm. These changes are intended to streamline the development process and make it easier to manage dependencies and set up the development environment.
* Type annotations on path-related unit tests ([#128](#128)). In this open-source library update, type annotations have been added to path-related unit tests to enhance code clarity and maintainability. The tests encompass various scenarios, including verifying if a path exists, creating, removing, and checking directories, and testing file attributes such as distinguishing directories, notebooks, and regular files. The additions also cover functionality for opening and manipulating files in different modes like read binary, write binary, read text, and write text. Furthermore, tests for checking file permissions, handling errors, and globbing (pattern-based file path matching) have been incorporated. The tests interact with a WorkspaceClient mock object, simulating file system interactions. This enhancement bolsters the library's reliability and assists developers in creating robust, well-documented code when working with file system paths.
* Updated `WorkspacePath` to support Python 3.12 ([#122](#122)). In this release, the `WorkspacePath` implementation has been updated to ensure compatibility with Python 3.12, in addition to Python 3.10 and 3.11. The class was modified to replace most of the internal implementation and add extensive tests for public interfaces, ensuring that the superclass implementations are not used unless they are known to be safe. This change is in response to the significant changes in the superclass implementations between Python 3.11 and 3.12, which were found to be incompatible with each other. The `WorkspacePath` class now includes several new methods and tests to ensure that it functions seamlessly with different versions of Python. These changes include testing for initialization, equality, hash, comparison, path components, and various path manipulations. This update enhances the library's adaptability and ensures it functions correctly with different versions of Python. Classifiers have also been updated to include support for Python 3.12.
* `WorkspacePath` fixes for the `.resolve()` implementation ([#129](#129)). The `.resolve()` method for `WorkspacePath` has been updated to improve its handling of relative paths and the `strict` argument. Previously, relative paths were not properly validated and would be returned as-is. Now, relative paths will cause the method to fail. The `strict` argument is now checked, and if set to `True` and the path does not exist, a `FileNotFoundError` will be raised. The method `.absolute()` is used to obtain the absolute path of the file or directory in Databricks Workspace and is used in the implementation of `.resolve()`. A new test, `test_resolve()`, has been added to verify these changes, covering scenarios where the path is absolute, the path exists, the path does not exist, and the path is relative. In the case of relative paths, a `NotImplementedError` is raised, as `.resolve()` is not supported for them.
* `WorkspacePath`: Fix the .rename() and .replace() implementations to return the target path ([#130](#130)). The `.rename()` and `.replace()` methods of the `WorkspacePath` class have been updated to return the target path as part of the public API, with `.rename()` no longer accepting the `overwrite` keyword argument and always failing if the target path already exists. A new private method, `._rename()`, has been added to include the `overwrite` argument and is used by both `.rename()` and `.replace()`. This update is a preparatory step for factoring out common code to support DBFS paths. The tests have been updated accordingly, combining and adding functions to test the new and updated methods. The `.unlink()` method's behavior remains unchanged. Please note that the exact error raised when `.rename()` fails due to an existing target path is yet to be defined.

Dependency updates:

 * Bump sigstore/gh-action-sigstore-python from 2.1.1 to 3.0.0 ([#133](#133)).
@nfx nfx mentioned this pull request Jul 16, 2024
nfx added a commit that referenced this pull request Jul 16, 2024
* Added `DBFSPath` as `os.PathLike` implementation
([#131](#131)). The
open-source library has been updated with a new class `DBFSPath`, an
implementation of `os.PathLike` for Databricks File System (DBFS) paths.
This new class extends the existing `WorkspacePath` support and provides
pathlib-like functionality for DBFS paths, including methods for
creating directories, renaming and deleting files and directories, and
reading and writing files. The addition of `DBFSPath` includes
type-hinting for improved code linting and is integrated in the test
suite with new and updated tests for path-like objects. The behavior of
the `exists` and `unlink` methods have been updated for `WorkspacePath`
to improve performance and raise appropriate errors.
* Fixed `.as_uri()` and `.absolute()` implementations for
`WorkspacePath`
([#127](#127)). In
this release, the `WorkspacePath` class in the `paths.py` module has
been updated with several improvements to the `.as_uri()` and
`.absolute()` methods. These methods now utilize PathLib internals,
providing better cross-version compatibility. The `.as_uri()` method now
uses an f-string for concatenation and returns the UTF-8 encoded string
representation of the `WorkspacePath` object via a new `__bytes__()`
dunder method. Additionally, the `.absolute()` method has been
implemented for the trivial (no-op) case and now supports returning the
absolute path of files or directories in Databricks Workspace.
Furthermore, the `glob()` and `rglob()` methods have been enhanced to
support case-sensitive pattern matching based on a new `case_sensitive`
parameter. To ensure the integrity of these changes, two new test cases,
`test_as_uri()` and `test_absolute()`, have been added, thoroughly
testing the functionality of these methods.
* Fixed `WorkspacePath` support for python 3.11
([#121](#121)). The
`WorkspacePath` class in our open-source library has been updated to
improve compatibility with Python 3.11. The `.expanduser()` and
`.glob()` methods have been modified to address internal changes in
Python 3.11. The `is_dir()` and `is_file()` methods now include a
`follow_symlinks` parameter, although it is not currently used. A new
method, `_scandir()`, has been added for compatibility with Python 3.11.
The `expanduser()` method has also been updated to expand `~` (but not
`~user`) constructs. Additionally, a new method `is_notebook()` has been
introduced to check if the path points to a notebook in Databricks
Workspace. These changes aim to ensure that the library functions
smoothly with the latest version of Python and provides additional
functionality for users working with Databricks Workspace.
* Properly verify versions of python
([#118](#118)). In
this release, we have made significant updates to the pyproject.toml
file to enhance project dependency and development environment
management. We have added several new packages to the `dependencies`
section to expand the library's functionality and compatibility.
Additionally, we have removed the `python` field, as it is no longer
necessary. We have also updated the `path` field to specify the location
of the virtual environment, which can improve integration with popular
development tools such as Visual Studio Code and PyCharm. These changes
are intended to streamline the development process and make it easier to
manage dependencies and set up the development environment.
* Type annotations on path-related unit tests
([#128](#128)). In
this open-source library update, type annotations have been added to
path-related unit tests to enhance code clarity and maintainability. The
tests encompass various scenarios, including verifying if a path exists,
creating, removing, and checking directories, and testing file
attributes such as distinguishing directories, notebooks, and regular
files. The additions also cover functionality for opening and
manipulating files in different modes like read binary, write binary,
read text, and write text. Furthermore, tests for checking file
permissions, handling errors, and globbing (pattern-based file path
matching) have been incorporated. The tests interact with a
WorkspaceClient mock object, simulating file system interactions. This
enhancement bolsters the library's reliability and assists developers in
creating robust, well-documented code when working with file system
paths.
* Updated `WorkspacePath` to support Python 3.12
([#122](#122)). In
this release, the `WorkspacePath` implementation has been updated to
ensure compatibility with Python 3.12, in addition to Python 3.10 and
3.11. The class was modified to replace most of the internal
implementation and add extensive tests for public interfaces, ensuring
that the superclass implementations are not used unless they are known
to be safe. This change is in response to the significant changes in the
superclass implementations between Python 3.11 and 3.12, which were
found to be incompatible with each other. The `WorkspacePath` class now
includes several new methods and tests to ensure that it functions
seamlessly with different versions of Python. These changes include
testing for initialization, equality, hash, comparison, path components,
and various path manipulations. This update enhances the library's
adaptability and ensures it functions correctly with different versions
of Python. Classifiers have also been updated to include support for
Python 3.12.
* `WorkspacePath` fixes for the `.resolve()` implementation
([#129](#129)). The
`.resolve()` method for `WorkspacePath` has been updated to improve its
handling of relative paths and the `strict` argument. Previously,
relative paths were not properly validated and would be returned as-is.
Now, relative paths will cause the method to fail. The `strict` argument
is now checked, and if set to `True` and the path does not exist, a
`FileNotFoundError` will be raised. The method `.absolute()` is used to
obtain the absolute path of the file or directory in Databricks
Workspace and is used in the implementation of `.resolve()`. A new test,
`test_resolve()`, has been added to verify these changes, covering
scenarios where the path is absolute, the path exists, the path does not
exist, and the path is relative. In the case of relative paths, a
`NotImplementedError` is raised, as `.resolve()` is not supported for
them.
* `WorkspacePath`: Fix the .rename() and .replace() implementations to
return the target path
([#130](#130)). The
`.rename()` and `.replace()` methods of the `WorkspacePath` class have
been updated to return the target path as part of the public API, with
`.rename()` no longer accepting the `overwrite` keyword argument and
always failing if the target path already exists. A new private method,
`._rename()`, has been added to include the `overwrite` argument and is
used by both `.rename()` and `.replace()`. This update is a preparatory
step for factoring out common code to support DBFS paths. The tests have
been updated accordingly, combining and adding functions to test the new
and updated methods. The `.unlink()` method's behavior remains
unchanged. Please note that the exact error raised when `.rename()`
fails due to an existing target path is yet to be defined.

Dependency updates:

* Bump sigstore/gh-action-sigstore-python from 2.1.1 to 3.0.0
([#133](#133)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants