Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UC volume support - closes #140 #173

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

gergo-databricks
Copy link

Add UC volume support - closes #140

Copy link

❌ 41/49 passed, 1 flaky, 8 failed, 2 skipped, 1m48s total

❌ test_open_text_io[VolumePath]: databricks.sdk.errors.platform.NotFound: Catalog '~' does not exist. (1.485s)
databricks.sdk.errors.platform.NotFound: Catalog '~' does not exist.
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw9] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] PUT /api/2.0/fs/directories/Volumes/~/Y6B4hzyLFv9UM2m0/a/b/c
< 404 Not Found
< {
<   "details": [
<     {
<       "@type": "type.googleapis.com/google.rpc.ErrorInfo",
<       "domain": "filesystem.databricks.com",
<       "metadata": {
<         "unity_catalog_error_message": "Catalog '~' does not exist."
<       },
<       "reason": "FILES_API_CATALOG_NOT_FOUND"
<     }
<   ],
<   "error_code": "NOT_FOUND",
<   "message": "Catalog '~' does not exist."
< }
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] PUT /api/2.0/fs/directories/Volumes/~/Y6B4hzyLFv9UM2m0/a/b/c
< 404 Not Found
< {
<   "details": [
<     {
<       "@type": "type.googleapis.com/google.rpc.ErrorInfo",
<       "domain": "filesystem.databricks.com",
<       "metadata": {
<         "unity_catalog_error_message": "Catalog '~' does not exist."
<       },
<       "reason": "FILES_API_CATALOG_NOT_FOUND"
<     }
<   ],
<   "error_code": "NOT_FOUND",
<   "message": "Catalog '~' does not exist."
< }
[gw9] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
❌ test_stat[VolumePath]: databricks.sdk.errors.platform.NotFound: Catalog '~' does not exist. (986ms)
databricks.sdk.errors.platform.NotFound: Catalog '~' does not exist.
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw0] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] PUT /api/2.0/fs/directories/Volumes/~/RvtStMfQ0UuADB4i/a/b/c
< 404 Not Found
< {
<   "details": [
<     {
<       "@type": "type.googleapis.com/google.rpc.ErrorInfo",
<       "domain": "filesystem.databricks.com",
<       "metadata": {
<         "unity_catalog_error_message": "Catalog '~' does not exist."
<       },
<       "reason": "FILES_API_CATALOG_NOT_FOUND"
<     }
<   ],
<   "error_code": "NOT_FOUND",
<   "message": "Catalog '~' does not exist."
< }
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] PUT /api/2.0/fs/directories/Volumes/~/RvtStMfQ0UuADB4i/a/b/c
< 404 Not Found
< {
<   "details": [
<     {
<       "@type": "type.googleapis.com/google.rpc.ErrorInfo",
<       "domain": "filesystem.databricks.com",
<       "metadata": {
<         "unity_catalog_error_message": "Catalog '~' does not exist."
<       },
<       "reason": "FILES_API_CATALOG_NOT_FOUND"
<     }
<   ],
<   "error_code": "NOT_FOUND",
<   "message": "Catalog '~' does not exist."
< }
[gw0] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
❌ test_mkdirs[VolumePath]: AssertionError: assert not True (772ms)
AssertionError: assert not True
 +  where True = <bound method _DatabricksPath.is_absolute of VolumePath('/Volumes/~/TYu1tF23VyIFr8AN/foo/bar/baz')>()
 +    where <bound method _DatabricksPath.is_absolute of VolumePath('/Volumes/~/TYu1tF23VyIFr8AN/foo/bar/baz')> = VolumePath('/Volumes/~/TYu1tF23VyIFr8AN/foo/bar/baz').is_absolute
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw7] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw7] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
❌ test_rename_file[VolumePath]: ValueError: Missing catalog, schema or volume name: Volumes/~/qWAQQGWkN3rPj61t (669ms)
ValueError: Missing catalog, schema or volume name: Volumes/~/qWAQQGWkN3rPj61t
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw7] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw7] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
❌ test_unlink[VolumePath]: ValueError: Missing catalog, schema or volume name: Volumes/~/h73TWm4OPvPxhFkg (701ms)
ValueError: Missing catalog, schema or volume name: Volumes/~/h73TWm4OPvPxhFkg
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw9] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw9] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
❌ test_open_binary_io[VolumePath]: ValueError: Missing catalog, schema or volume name: Volumes/~/1appBK9hrhZYaMjh (855ms)
ValueError: Missing catalog, schema or volume name: Volumes/~/1appBK9hrhZYaMjh
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw2] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw2] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
❌ test_replace_file[VolumePath]: ValueError: Missing catalog, schema or volume name: Volumes/~/X2GR1ybrA2ZTha7J (598ms)
ValueError: Missing catalog, schema or volume name: Volumes/~/X2GR1ybrA2ZTha7J
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw7] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw7] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
❌ test_resolve_is_consistent[VolumePath]: ValueError: Missing catalog, schema or volume name: /Volumes/a/d (194ms)
ValueError: Missing catalog, schema or volume name: /Volumes/a/d
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw8] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
18:16 DEBUG [databricks.sdk] Loaded from environment
18:16 DEBUG [databricks.sdk] Ignoring pat auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Ignoring basic auth, because metadata-service is preferred
18:16 DEBUG [databricks.sdk] Attempting to configure auth: metadata-service
18:16 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw8] linux -- Python 3.10.15 /home/runner/work/blueprint/blueprint/.venv/bin/python

Flaky tests:

  • 🤪 test_stat[DBFSPath] (1.388s)

Running from acceptance #242

home directory to the previously created directory (`~/some-folder/foo/bar/baz`), and verifies it matches the expected
relative path (`some-folder/foo/bar/baz`). It then confirms that the expanded path is absolute, checks that
calling `absolute()` on this path returns the path itself, and converts the path to a FUSE-compatible path
This code expands the `~` symbol to the full path of the user's home directory, computes the relative path from this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert irrelevant changes to this file

@@ -42,7 +42,7 @@ def _inner(*_, **__):
return _inner


class _UploadIO(abc.ABC):
class _WsUploadIO(abc.ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class _WsUploadIO(abc.ABC):
class _WorkspaceUploadIO(abc.ABC):

small nit

self._cached_is_directory = None
self._parse_volume_name()

def get_catalog_name(self) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_catalog_name(self) -> str:
@property
def catalog_name(self) -> str:

i think os.PathLike experience implies these as properties. what do you think?

@@ -1041,3 +1281,17 @@ def _select_children(self, path: T) -> Iterable[T]:
if candidate not in yielded:
yielded.add(candidate)
yield candidate


def create_path(ws: WorkspaceClient, path: str) -> _DatabricksPath:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def create_path(ws: WorkspaceClient, path: str) -> _DatabricksPath:
def as_databricks_path(ws: WorkspaceClient, path: str) -> _DatabricksPath:

Comment on lines +1290 to +1291
if path_without_scheme.startswith("/Volumes/"):
return VolumePath(ws, path_without_scheme)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if path_without_scheme.startswith("/Volumes/"):
return VolumePath(ws, path_without_scheme)
parts = path_without_scheme.split('/')
if parts[0] == 'Volumes':
return VolumePath(ws, '/' + '/'.join(parts))

and other cases for Workspace and dbfs as well. e.g. here you need to convert FUSE path into the relevant implementation. cases for dbfs and wsfs are incorrect here.

because we don't require /Workspace prefix in the actual instance.

@pytest.mark.parametrize("cls", DATABRICKS_PATHLIKE)
def test_exists(ws, cls):
    wsp = cls(ws, "/Users/foo/bar/baz")
    assert not wsp.exists()
    }

Comment on lines +1151 to +1158
("/Workspace/my/path/file.ext", WorkspacePath, "/Workspace/my/path/file.ext"),
("file:/Workspace/my/path/file.ext", WorkspacePath, "/Workspace/my/path/file.ext"),
("/Volumes/my/path/to/file.ext", VolumePath, "/Volumes/my/path/to/file.ext"),
("file:/Volumes/my/path/to/file.ext", VolumePath, "/Volumes/my/path/to/file.ext"),
("dbfs:/Volumes/my/path/to/file.ext", VolumePath, "/Volumes/my/path/to/file.ext"),
("/dbfs/my/path/file.ext", DBFSPath, "/dbfs/my/path/file.ext"),
("file:/dbfs/my/path/file.ext", DBFSPath, "/dbfs/my/path/file.ext"),
("dbfs:/my/path/file.ext", DBFSPath, "/my/path/file.ext"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
("/Workspace/my/path/file.ext", WorkspacePath, "/Workspace/my/path/file.ext"),
("file:/Workspace/my/path/file.ext", WorkspacePath, "/Workspace/my/path/file.ext"),
("/Volumes/my/path/to/file.ext", VolumePath, "/Volumes/my/path/to/file.ext"),
("file:/Volumes/my/path/to/file.ext", VolumePath, "/Volumes/my/path/to/file.ext"),
("dbfs:/Volumes/my/path/to/file.ext", VolumePath, "/Volumes/my/path/to/file.ext"),
("/dbfs/my/path/file.ext", DBFSPath, "/dbfs/my/path/file.ext"),
("file:/dbfs/my/path/file.ext", DBFSPath, "/dbfs/my/path/file.ext"),
("dbfs:/my/path/file.ext", DBFSPath, "/my/path/file.ext"),
("/Workspace/my/path/file.ext", WorkspacePath, "/my/path/file.ext"),
("file:/Workspace/my/path/file.ext", WorkspacePath, "/my/path/file.ext"),
("/Volumes/my/path/to/file.ext", VolumePath, "/Volumes/my/path/to/file.ext"),
("file:/Volumes/my/path/to/file.ext", VolumePath, "/Volumes/my/path/to/file.ext"),
("dbfs:/Volumes/my/path/to/file.ext", VolumePath, "/Volumes/my/path/to/file.ext"),
("/dbfs/my/path/file.ext", DBFSPath, "/my/path/file.ext"),
("file:/dbfs/my/path/file.ext", DBFSPath, "/my/path/file.ext"),
("dbfs:/my/path/file.ext", DBFSPath, "/my/path/file.ext"),

these are the correct expected test cases for as_posix. what you've specified is as_fuse

Comment on lines +1070 to +1076
assert path.get_catalog_name() == "a"
assert path.get_schema_name(False) == "b"
assert path.get_schema_name(True) == "a.b"
assert path.get_volume_name(True, True) == "a.b.c"
assert path.get_volume_name(True, False) == "a.b.c"
assert path.get_volume_name(False, True) == "b.c"
assert path.get_volume_name(False, False) == "c"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert path.get_catalog_name() == "a"
assert path.get_schema_name(False) == "b"
assert path.get_schema_name(True) == "a.b"
assert path.get_volume_name(True, True) == "a.b.c"
assert path.get_volume_name(True, False) == "a.b.c"
assert path.get_volume_name(False, True) == "b.c"
assert path.get_volume_name(False, False) == "c"
assert path.catalog_name == "a"
assert path.schema_name == "b"
assert path.full_schema_name == "a.b"
assert path.full_name == "a.b.c"

get_volume_name(False, False) is not clear at all.

def test_volume_conversions() -> None:
ws = create_autospec(WorkspaceClient)
ws.config.host = "https://example.org"
path = VolumePath(ws, "/Volumes/a/b/c/d/e.f")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asnare do you this this API is aligned with os.PathLike?


def as_fuse(self) -> Path:
"""Return FUSE-mounted path in Databricks Runtime."""
if "DATABRICKS_RUNTIME_VERSION" not in os.environ:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, apparently, as_fuse equals to as_posix for UC volumes

return f"{self._catalog_name}.{self._schema_name}"
return self._schema_name

def get_volume_name(self, with_catalog_name: bool = False, with_schema_name: bool = False) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can technically also add

@property
def volume_info(self):
  return self._ws.volumes.get(self.full_name)

@property
def schema_info(self): return self._ws.schemas.get(self.schema_name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UC Volume support
2 participants