Skip to content

Commit

Permalink
Merge branch 'main' into refactor-pattern-logic-catalog-cli
Browse files Browse the repository at this point in the history
  • Loading branch information
ElenaKhaustova committed Aug 12, 2024
2 parents 20d2b91 + d5916a1 commit 13e6da9
Show file tree
Hide file tree
Showing 8 changed files with 30 additions and 81 deletions.
1 change: 1 addition & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Upcoming Release 0.19.8

## Major features and improvements
* Made default run entrypoint in `__main__.py` work in interactive environments such as IPyhon and Databricks.

Check warning on line 4 in RELEASE.md

View workflow job for this annotation

GitHub Actions / vale

[vale] RELEASE.md#L4

[Kedro.Spellings] Did you really mean 'IPyhon'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'IPyhon'?", "location": {"path": "RELEASE.md", "range": {"start": {"line": 4, "column": 89}}}, "severity": "WARNING"}

## Bug fixes and other changes
* Moved `_find_run_command()` and `_find_run_command_in_plugins()` from `__main__.py` in the project template to the framework itself.
Expand Down
86 changes: 14 additions & 72 deletions docs/source/deployment/databricks/databricks_deployment_workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,8 @@ The sequence of steps described in this section is as follows:
2. [Install Kedro and the Databricks CLI in a new virtual environment](#install-kedro-and-the-databricks-cli-in-a-new-virtual-environment)
3. [Authenticate the Databricks CLI](#authenticate-the-databricks-cli)
4. [Create a new Kedro project](#create-a-new-kedro-project)
5. [Create an entry point for Databricks](#create-an-entry-point-for-databricks)
6. [Package your project](#package-your-project)
7. [Upload project data and configuration to DBFS](#upload-project-data-and-configuration-to-dbfs)
5. [Package your project](#package-your-project)
6. [Upload project data and configuration to DBFS](#upload-project-data-and-configuration-to-dbfs)

### Note your Databricks username and host

Expand Down Expand Up @@ -99,64 +98,6 @@ This command creates a new Kedro project using the `databricks-iris` starter tem
If you are not using the `databricks-iris` starter to create a Kedro project, **and** you are working with a version of Kedro **earlier than 0.19.0**, then you should [disable file-based logging](https://docs.kedro.org/en/0.18.14/logging/logging.html#disable-file-based-logging) to prevent Kedro from attempting to write to the read-only file system.
```

### Create an entry point for Databricks

The default entry point of a Kedro project uses a Click command line interface (CLI), which is not compatible with Databricks. To run your project as a Databricks job, you must define a new entry point specifically for use on Databricks.

The `databricks-iris` starter has this entry point pre-built, so there is no extra work to do here, but generally you must **create an entry point manually for your own projects using the following steps**:

1. **Create an entry point script**: Create a new file in `<project_root>/src/iris_databricks` named `databricks_run.py`. Copy the following code to this file:

```python
import argparse
import logging

from kedro.framework.project import configure_project
from kedro.framework.session import KedroSession


def main():
parser = argparse.ArgumentParser()
parser.add_argument("--env", dest="env", type=str)
parser.add_argument("--conf-source", dest="conf_source", type=str)
parser.add_argument("--package-name", dest="package_name", type=str)

args = parser.parse_args()
env = args.env
conf_source = args.conf_source
package_name = args.package_name

# https://kb.databricks.com/notebooks/cmd-c-on-object-id-p0.html
logging.getLogger("py4j.java_gateway").setLevel(logging.ERROR)
logging.getLogger("py4j.py4j.clientserver").setLevel(logging.ERROR)

configure_project(package_name)
with KedroSession.create(env=env, conf_source=conf_source) as session:
session.run()


if __name__ == "__main__":
main()
```

2. **Define a new entry point**: Open `<project_root>/pyproject.toml` in a text editor or IDE and add a new line in the `[project.scripts]` section, so that it becomes:

```python
[project.scripts]
databricks_run = "<package_name>.databricks_run:main"
```

Remember to replace <package_name> with the correct package name for your project.

This process adds an entry point to your project which can be used to run it on Databricks.

```{note}
Because you are no longer using the default entry-point for Kedro, you will not be able to run your project with the options it usually provides. Instead, the `databricks_run` entry point in the above code and in the `databricks-iris` starter contains a simple implementation of two options:
- `--package_name` (required): the package name (defined in `pyproject.toml`) of your packaged project.
- `--env`: specifies a [Kedro configuration environment](../../configuration/configuration_basics.md#configuration-environments) to load for your run.
- `--conf-source`: specifies the location of the `conf/` directory to use with your Kedro project.
```

### Package your project

To package your Kedro project for deployment on Databricks, you must create a Wheel (`.whl`) file, which is a binary distribution of your project. In the root directory of your Kedro project, run the following command:
Expand All @@ -182,14 +123,14 @@ There are several ways to upload data to DBFS: you can use the [DBFS API](https:
- **Upload your project's data and config**: at the command line in your local environment, use the following Databricks CLI commands to upload your project's locally stored data and configuration to DBFS:

```bash
databricks fs cp --recursive <project_root>/data/ dbfs:/FileStore/iris-databricks/data
databricks fs cp --recursive <project_root>/conf/ dbfs:/FileStore/iris-databricks/conf
databricks fs cp --recursive <project_root>/data/ dbfs:/FileStore/iris_databricks/data
databricks fs cp --recursive <project_root>/conf/ dbfs:/FileStore/iris_databricks/conf
```

The `--recursive` flag ensures that the entire folder and its contents are uploaded. You can list the contents of the destination folder in DBFS using the following command:

```bash
databricks fs ls dbfs:/FileStore/iris-databricks/data
databricks fs ls dbfs:/FileStore/iris_databricks/data
```

You should see the contents of the project's `data/` directory printed to your terminal:
Expand All @@ -205,6 +146,10 @@ You should see the contents of the project's `data/` directory printed to your t
08_reporting
```

```{note}
If you are not using the `databricks-iris` starter to create a Kedro project, then you should make sure your catalog entries point to the DBFS storage.
```

## Deploy and run your Kedro project using the workspace UI

To run your packaged project on Databricks, login to your Databricks account and perform the following steps in the workspace:
Expand Down Expand Up @@ -235,9 +180,6 @@ Configure the job cluster with the following settings:
- In the `name` field enter `kedro_deployment_demo`.
- Select the radio button for `Single node`.
- Select the runtime `13.3 LTS` in the `Databricks runtime version` field.
- In the `Advanced options` section, under the `Spark` tab, locate the `Environment variables` field. Add the following line:
`KEDRO_LOGGING_CONFIG="/dbfs/FileStore/iris-databricks/conf/logging.yml"`
Here, ensure you specify the correct path to your custom logging configuration. This step is crucial because the default Kedro logging configuration incorporates the rich library, which is incompatible with Databricks jobs. In the `databricks-iris` Kedro starter, the `rich` handler in `logging.yml` is altered to a `console` handler for compatibility. For additional information about logging configurations, refer to the [Kedro Logging Manual](https://docs.kedro.org/en/stable/logging/index.html).
- Leave all other settings with their default values in place.

The final configuration for the job cluster should look the same as the following:
Expand All @@ -250,14 +192,14 @@ Configure the job with the following settings:

- Enter `iris-databricks` in the `Name` field.
- In the dropdown menu for the `Type` field, select `Python wheel`.
- In the `Package name` field, enter `iris_databricks`. This is the name of your package as defined in your project's `src/setup.py` file.
- In the `Entry Point` field, enter `databricks_run`. This is the name of the [entry point](#create-an-entry-point-for-databricks) to run your package from.
- In the `Package name` field, enter `iris_databricks`. This is the name of your package as defined in your project's `pyproject.toml` file.
- In the `Entry Point` field, enter `iris-databricks`. This is the name of the entry point to run your package from as defined in your project's `pyproject.toml` file.
- Ensure the job cluster you created in step two is selected in the dropdown menu for the `Cluster` field.
- In the `Dependent libraries` field, click `Add` and upload [your project's `.whl` file](#package-your-project), making sure that the radio buttons for `Upload` and `Python Whl` are selected for the `Library Source` and `Library Type` fields.
- In the `Parameters` field, enter the following list of runtime options:
- In the `Parameters` field, enter the following runtime option:

```bash
["--conf-source", "/dbfs/FileStore/iris-databricks/conf", "--package-name", "iris_databricks"]
["--conf-source", "/dbfs/FileStore/iris_databricks/conf"]
```

The final configuration for your job should look the same as the following:
Expand All @@ -278,7 +220,7 @@ The following things happen when you run your job:

- The job cluster is provisioned and started (job status: `Pending`).
- The packaged Kedro project and all its dependencies are installed (job status: `Pending`)
- The packaged Kedro project is run from the specified `databricks_run` entry point (job status: `In Progress`).
- The packaged Kedro project is run from the specified `iris-databricks` entry point (job status: `In Progress`).
- The packaged code finishes executing and the job cluster is stopped (job status: `Succeeded`).

A run will take roughly six to seven minutes.
Expand Down
Binary file modified docs/source/meta/images/databricks_configure_job_cluster.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/source/meta/images/databricks_configure_new_job.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
"""{{ cookiecutter.project_name }} file for ensuring the package is executable
as `{{ cookiecutter.repo_name }}` and `python -m {{ cookiecutter.python_package }}`
"""

import sys
from pathlib import Path

from kedro.framework.cli.utils import find_run_command
Expand All @@ -10,6 +12,10 @@
def main(*args, **kwargs):
package_name = Path(__file__).parent.name
configure_project(package_name)

interactive = hasattr(sys, 'ps1')
kwargs["standalone_mode"] = not interactive

run = find_run_command(package_name)
run(*args, **kwargs)

Expand Down
9 changes: 2 additions & 7 deletions kedro/framework/cli/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
from __future__ import annotations

import importlib
import logging
import sys
import traceback
from collections import defaultdict
Expand Down Expand Up @@ -39,9 +38,6 @@
v{version}
"""

logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stderr))


@click.group(context_settings=CONTEXT_SETTINGS, name="Kedro")
@click.version_option(version, "--version", "-V", help="Show version and exit")
Expand Down Expand Up @@ -208,13 +204,12 @@ def main(
click.echo(message)
click.echo(hint)
sys.exit(exc.code)
except Exception as error:
logger.error(f"An error has occurred: {error}")
except Exception:
self._cli_hook_manager.hook.after_command_run(
project_metadata=self._metadata, command_args=args, exit_code=1
)
hook_called = True
sys.exit(1)
raise
finally:
if not hook_called:
self._cli_hook_manager.hook.after_command_run(
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""{{ cookiecutter.project_name }} file for ensuring the package is executable
as `{{ cookiecutter.repo_name }}` and `python -m {{ cookiecutter.python_package }}`
"""
import sys
from pathlib import Path

from kedro.framework.cli.utils import find_run_command
Expand All @@ -10,6 +11,10 @@
def main(*args, **kwargs):
package_name = Path(__file__).parent.name
configure_project(package_name)

interactive = hasattr(sys, 'ps1')
kwargs["standalone_mode"] = not interactive

run = find_run_command(package_name)
run(*args, **kwargs)

Expand Down
4 changes: 2 additions & 2 deletions tests/framework/cli/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -520,7 +520,7 @@ def test_main_hook_exception_handling(self, fake_metadata):
project_metadata=kedro_cli._metadata, command_args=[], exit_code=1
)

assert "An error has occurred: Test Exception" in result.output
assert result.exit_code == 1

@patch("sys.exit")
def test_main_hook_finally_block(self, fake_metadata):
Expand All @@ -535,7 +535,7 @@ def test_main_hook_finally_block(self, fake_metadata):
project_metadata=kedro_cli._metadata, command_args=[], exit_code=0
)

assert "An error has occurred:" not in result.output
assert result.exit_code == 0


@mark.usefixtures("chdir_to_dummy_project")
Expand Down

0 comments on commit 13e6da9

Please sign in to comment.