Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kedro catalog resolve command #2891

Merged
merged 35 commits into from
Aug 18, 2023
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
1a81700
Implement working draft
Aug 3, 2023
0f5b79f
Remove comments
Aug 3, 2023
b70f8e6
Merge branch 'main' into feat/add-kedro-catalog-resolve-command
AhdraMeraliQB Aug 3, 2023
868671e
Remove raw_config
Aug 4, 2023
c3aec4c
Use config loader to access catalog config
Aug 4, 2023
2f956da
Use dict comprehension
Aug 4, 2023
38a0394
Merge branch 'main' into feat/add-kedro-catalog-resolve-command
AhdraMeraliQB Aug 8, 2023
04dc325
Remove pipeline filtering
Aug 9, 2023
c30c565
Prevent overwrite or param inclusion
Aug 9, 2023
1df1c3d
Trim filepath
Aug 9, 2023
5758060
Appease linter
Aug 9, 2023
473ddf1
Add test for resolve
Aug 11, 2023
1e387d4
Add test to exclude params
Aug 11, 2023
e88f695
Add test for overwrite (not working)
Aug 11, 2023
96b84de
Fix test
Aug 11, 2023
dd3117c
Remove print
Aug 11, 2023
97d85ee
Merge branch 'main' into feat/add-kedro-catalog-resolve-command
AhdraMeraliQB Aug 11, 2023
ac2cfc4
Add changes to RELEASE.md
Aug 11, 2023
92d588f
Trim unreachable code
Aug 11, 2023
09339c3
Make helper function private
Aug 11, 2023
3fac8b3
Refactor
Aug 11, 2023
7d7036f
Add changes to the docs
Aug 11, 2023
7caa1af
Refactor 2
Aug 11, 2023
271236c
Fix docs build
Aug 11, 2023
c19707d
Add suggestions from code review (docs)
Aug 15, 2023
8784a37
Add suggestions from code review (code)
Aug 15, 2023
259c5e5
Merge branch 'main' into feat/add-kedro-catalog-resolve-command
AhdraMeraliQB Aug 15, 2023
779a286
Rename catalog_config variable to explicit_datasets
Aug 17, 2023
6041ea5
Apply suggestions from code review
AhdraMeraliQB Aug 17, 2023
3dec9f4
Fix mocking in test
Aug 17, 2023
b026bfe
Change test fixture name
Aug 17, 2023
64366c8
Lint
Aug 17, 2023
a10feee
Merge branch 'main' into feat/add-kedro-catalog-resolve-command
AhdraMeraliQB Aug 18, 2023
f3884df
Merge branch 'main' into feat/add-kedro-catalog-resolve-command
astrojuanlu Aug 18, 2023
365d1d5
Merge branch 'main' into feat/add-kedro-catalog-resolve-command
AhdraMeraliQB Aug 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
## Major features and improvements
* Allowed registering of custom resolvers to `OmegaConfigLoader` through `CONFIG_LOADER_ARGS`.
* Added support for Python 3.11. This includes tackling challenges like dependency pinning and test adjustments to ensure a smooth experience. Detailed migration tips are provided below for further context.
* Added `kedro catalog resolve` CLI command that resolves dataset factories in the catalog with any explicit entries in the project pipeline.

## Bug fixes and other changes
* Updated `kedro pipeline create` and `kedro catalog create` to use new `/conf` file structure.
Expand Down
9 changes: 9 additions & 0 deletions docs/source/development/commands_reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ Here is a list of Kedro CLI commands, as a shortcut to the descriptions below. P
* [`kedro build-docs`](#build-the-project-documentation) (deprecated from version 0.19.0)
* [`kedro build-reqs`](#build-the-projects-dependency-tree) (deprecated from version 0.19.0)
* [`kedro catalog list`](#list-datasets-per-pipeline-per-type)
* [`kedro catalog resolve`](#resolve-dataset-factories-in-the-catalog)
* [`kedro catalog rank`](#rank-dataset-factories-in-the-catalog)
* [`kedro catalog create`](#create-a-data-catalog-yaml-configuration-file)
* [`kedro ipython`](#notebooks)
Expand Down Expand Up @@ -492,6 +493,14 @@ The command also accepts an optional `--pipeline` argument that allows you to sp
kedro catalog list --pipeline=ds,de
```

##### Resolve dataset factories in the catalog

```bash
kedro catalog resolve
```

This command resolves dataset factories in the catalog file with any explicit entries in the pipeline. The output includes datasets explicitly mentioned in your catalog files and any datasets mentioned in the project's pipelines that resolve some dataset factory.
AhdraMeraliQB marked this conversation as resolved.
Show resolved Hide resolved

##### Rank dataset factories in the catalog

```bash
Expand Down
48 changes: 48 additions & 0 deletions kedro/framework/cli/catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,3 +207,51 @@ def rank_catalog_factories(metadata: ProjectMetadata, env):
click.echo(yaml.dump(list(catalog_factories.keys())))
else:
click.echo("There are no dataset factories in the catalog.")


@catalog.command("resolve")
@env_option
@click.pass_obj
def resolve_patterns(metadata: ProjectMetadata, env):
"""Resolve catalog factories against pipeline datasets"""

session = _create_session(metadata.package_name, env=env)
context = session.load_context()

data_catalog = context.catalog
catalog_config = context.config_loader["catalog"]

catalog_config = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit confusing that these variables have the exact name. In general it is best practice not to use the same variable as input and reassign the value in the same step. Maybe the second catalog_config can be re-named to explicit_dataset. That's what's stored there right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Changed in 779a286

ds_name: ds_config
for ds_name, ds_config in catalog_config.items()
if not data_catalog._is_pattern(ds_name)
}

target_pipelines = pipelines.keys()
datasets = set()

for pipe in target_pipelines:
AhdraMeraliQB marked this conversation as resolved.
Show resolved Hide resolved
pl_obj = pipelines.get(pipe)
if pl_obj:
datasets.update(pl_obj.data_sets())

for ds_name in datasets:
is_param = ds_name.startswith("params:") or ds_name == "parameters"
if ds_name in catalog_config or is_param:
continue

matched_pattern = data_catalog._match_pattern(
data_catalog._dataset_patterns, ds_name
)
if matched_pattern:
ds_config = data_catalog._resolve_config(ds_name, matched_pattern)
ds_config["filepath"] = _trim_filepath(
str(context.project_path) + "/", ds_config["filepath"]
)
catalog_config[ds_name] = ds_config

secho(yaml.dump(catalog_config))


def _trim_filepath(project_path: str, file_path: str):
return file_path.replace(project_path, "", 1)
135 changes: 135 additions & 0 deletions tests/framework/cli/test_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,23 @@ def fake_catalog_with_overlapping_factories():
return config


@pytest.fixture
def fake_catalog_config_with_overwrite():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "config_with_overwrite"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context this means a catalog configuration where one of the explicit entries in the catalog also matches a pattern in the catalog, and the test is ensuring the explicit entry is not overwritten. Perhaps fake_catalog_config_with_overwritable_dataset() could be more intuitive? Or fake_catalog_config_with_resolvable_dataset()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fake_catalog_config_with_resolvable_dataset makes more sense to me.

config = {
"parquet_{factory_pattern}": {
"type": "pandas.ParquetDataSet",
"filepath": "test.pq",
},
"csv_{factory_pattern}": {"type": "pandas.CSVDataSet", "filepath": "test.csv"},
"explicit_ds": {"type": "pandas.CSVDataSet", "filepath": "test.csv"},
"{factory_pattern}_ds": {
"type": "pandas.ParquetDataSet",
"filepath": "test.pq",
},
}
return config


@pytest.mark.usefixtures(
"chdir_to_dummy_project", "fake_load_context", "mock_pipelines"
)
Expand Down Expand Up @@ -441,3 +458,121 @@ def test_rank_catalog_factories_with_no_factories(
assert not result.exit_code
expected_output = "There are no dataset factories in the catalog."
assert expected_output in result.output


@pytest.mark.usefixtures(
"chdir_to_dummy_project", "fake_load_context", "mock_pipelines"
)
def test_catalog_resolve(
fake_project_cli,
fake_metadata,
fake_load_context,
mocker,
mock_pipelines,
fake_catalog_config,
):
"""Test that datasets factories are correctly resolved to the explicit datasets in the pipeline."""
yaml_dump_mock = mocker.patch("yaml.dump", return_value="Result YAML")
mocked_context = fake_load_context.return_value
mocked_context.catalog = DataCatalog.from_config(fake_catalog_config)

placeholder_ds = mocked_context.catalog._data_sets.keys()
explicit_ds = {"csv_example", "parquet_example"}

mocker.patch.object(
mock_pipelines[PIPELINE_NAME],
"data_sets",
return_value=explicit_ds,
)

result = CliRunner().invoke(
fake_project_cli, ["catalog", "resolve"], obj=fake_metadata
)

assert not result.exit_code
assert yaml_dump_mock.call_count == 1

output = yaml_dump_mock.call_args[0][0]

for ds in placeholder_ds:
assert ds not in output

for ds in explicit_ds:
assert ds in output


@pytest.mark.usefixtures(
"chdir_to_dummy_project", "fake_load_context", "mock_pipelines"
)
def test_no_overwrite(
fake_project_cli,
fake_metadata,
fake_load_context,
mocker,
mock_pipelines,
fake_catalog_config_with_overwrite,
):
"""Test that explicit catalog entries are not overitten by factory config."""
AhdraMeraliQB marked this conversation as resolved.
Show resolved Hide resolved
yaml_dump_mock = mocker.patch("yaml.dump", return_value="Result YAML")
mocked_context = fake_load_context.return_value

mocked_context.config_loader = {"catalog": fake_catalog_config_with_overwrite}
mocked_context.catalog = DataCatalog.from_config(fake_catalog_config_with_overwrite)

mocker.patch.object(
mock_pipelines[PIPELINE_NAME],
"data_sets",
return_value=mocked_context.catalog._data_sets.keys()
| {"csv_example", "parquet_example"},
)

result = CliRunner().invoke(
fake_project_cli, ["catalog", "resolve"], obj=fake_metadata
)

assert not result.exit_code
assert yaml_dump_mock.call_count == 1

assert (
yaml_dump_mock.call_args[0][0]["explicit_ds"]
== fake_catalog_config_with_overwrite["explicit_ds"]
)


@pytest.mark.usefixtures(
"chdir_to_dummy_project", "fake_load_context", "mock_pipelines"
)
def test_no_param_datasets_in_resolve(
fake_project_cli, fake_metadata, fake_load_context, mocker, mock_pipelines
):

yaml_dump_mock = mocker.patch("yaml.dump", return_value="Result YAML")
mocked_context = fake_load_context.return_value
catalog_data_sets = {
"iris_data": CSVDataSet("test.csv"),
"intermediate": MemoryDataset(),
"parameters": MemoryDataset(),
"params:data_ratio": MemoryDataset(),
}

mocked_context.catalog = DataCatalog(data_sets=catalog_data_sets)
mocker.patch.object(
mock_pipelines[PIPELINE_NAME],
"data_sets",
return_value=catalog_data_sets.keys(),
)

result = CliRunner().invoke(
fake_project_cli,
["catalog", "resolve"],
obj=fake_metadata,
)

assert not result.exit_code
assert yaml_dump_mock.call_count == 1

# 'parameters' and 'params:data_ratio' should not appear in the output
output = yaml_dump_mock.call_args[0][0]

assert "parameters" not in output.keys()
assert "params:data_ratio" not in output.keys()
Comment on lines +592 to +593
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To demonstrate the right datasets are included in output it might be worth adding two more asserts for iris_data and intermediate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 3dec9f4