From 5ad77752bfe4e549fe5043afa3066050f6486b23 Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> Date: Tue, 15 Oct 2024 13:48:45 +0100 Subject: [PATCH] Revise Kedro project structure docs (#4208) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Update project structure docs --------- Signed-off-by: Dmitry Sorokin Signed-off-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> Co-authored-by: Juan Luis Cano Rodríguez --- docs/source/get_started/kedro_concepts.md | 67 ++++++++++++++++++----- 1 file changed, 52 insertions(+), 15 deletions(-) diff --git a/docs/source/get_started/kedro_concepts.md b/docs/source/get_started/kedro_concepts.md index ffe602a7e2..44f54ac4d8 100644 --- a/docs/source/get_started/kedro_concepts.md +++ b/docs/source/get_started/kedro_concepts.md @@ -63,20 +63,53 @@ The Kedro Data Catalog is the registry of all data sources that the project can One of the main advantages of working with Kedro projects is that they follow a default template that makes collaboration straightforward. Kedro uses semantic naming to set up a default project with specific folders to store datasets, notebooks, configuration and source code. We advise you to retain the default Kedro project structure to make it easy to share your projects with other Kedro users, although you can adapt the folder structure if you need to. -The default Kedro project structure is as follows: +Starting from Kedro 0.19, when you create a new project with `kedro new`, you can customise the structure by selecting which tools to include. Depending on your choices, the resulting structure may vary. Below, we outline the default project structure when all tools are selected and give an example with no tools selected. + +### Default Kedro project structure (all tools selected) + +If you select all tools during project creation, your project structure will look like this: + +``` +project-dir # Parent directory of the template +├── conf # Project configuration files +├── data # Local project data (not committed to version control) +├── docs # Project documentation +├── notebooks # Project-related Jupyter notebooks (can be used for experimental code before moving the code to src) +├── src # Project source code +├── tests # Folder containing unit and integration tests +├── .gitignore # Hidden file that prevents staging of unnecessary files to `git` +├── pyproject.toml # Identifies the project root and contains configuration information +├── README.md # Project README +├── requirements.txt # Project dependencies file +``` + +### Example Kedro project structure (no tools selected) + +If you select no tools, the resulting structure will be simpler: ``` -project-dir # Parent directory of the template -├── .gitignore # Hidden file that prevents staging of unnecessary files to `git` -├── conf # Project configuration files -├── data # Local project data (not committed to version control) -├── docs # Project documentation -├── notebooks # Project-related Jupyter notebooks (can be used for experimental code before moving the code to src) -├── pyproject.toml # Identifies the project root and contains configuration information -├── README.md # Project README -└── src # Project source code +project-dir # Parent directory of the template +├── conf # Project configuration files +├── notebooks # Project-related Jupyter notebooks (can be used for experimental code before moving the code to src) +├── src # Project source code +├── .gitignore # Hidden file that prevents staging of unnecessary files to `git` +├── pyproject.toml # Identifies the project root and contains configuration information +├── README.md # Project README +├── requirements.txt # Project dependencies file ``` +### Tool selection and resulting structure + +During `kedro new`, you can select which [tools to include in your project](../starters/new_project_tools.md). Each tool adds specific files or folders to the project structure: + +- **Lint (Ruff)**: Modifies the `pyproject.toml` file to include Ruff configuration settings for linting. It sets up `ruff` under `[tool.ruff]`, defines options like line length, selected rules, and ignored rules, and includes `ruff` as an optional `dev` dependency. +- **Test (Pytest)**: Adds a `tests` folder for storing unit and integration tests, helping to maintain code quality and ensuring that changes in the codebase do not introduce bugs. For more information about testing in Kedro, visit the [Automated Testing Guide](../development/automated_testing.md). +- **Log**: Allows specific logging configurations by including a `logging.yml` file inside the `conf` folder. For more information about logging customisation in Kedro, visit the [Logging Customisation Guide](../logging/index.md). +- **Docs (Sphinx)**: Adds a `docs` folder with a Sphinx documentation setup. This folder is typically used to generate technical documentation for the project. +- **Data Folder**: Adds a `data` folder structure for managing project data. The `data` folder contains multiple subfolders to store project data. We recommend you put raw data into `raw` and move processed data to other subfolders, as outlined [in this data engineering article](https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71). +- **PySpark**: Adds PySpark-specific configuration files. +- **Kedro-Viz**: Adds Kedro's native visualisation tool with [experiment tracking setup.](https://docs.kedro.org/projects/kedro-viz/en/stable/experiment_tracking.html) + ### `conf` The `conf` folder contains two subfolders for storing configuration information: `base` and `local`. @@ -88,7 +121,7 @@ Use the `base` subfolder for project-specific settings to share across different The folder contains three files for the example, but you can add others as you require: - `catalog.yml` - [Configures the Data Catalog](../data/data_catalog.md#use-the-data-catalog-within-kedro-configuration) with the file paths and load/save configuration needed for different datasets -- `logging.yml` - Uses Python's default [`logging`](https://docs.python.org/3/library/logging.html) library to set up logging +- `logging.yml` - Uses Python's default [`logging`](https://docs.python.org/3/library/logging.html) library to set up logging (only added if the Log tool is selected). - `parameters.yml` - Allows you to define parameters for machine learning experiments, for example, train/test split and the number of iterations #### `conf/local` @@ -99,10 +132,14 @@ Use the `local` subfolder for **settings that should not be shared**, such as ac By default, Kedro creates one file, `credentials.yml`, in `conf/local`. -### `data` - -The `data` folder contains multiple subfolders to store project data. We recommend you put raw data into `raw` and move processed data to other subfolders according to the [commonly accepted data engineering convention](https://towardsdatascience.com/the-importance-of-layered-thinking-in-data-engineering-a09f685edc71). - ### `src` This subfolder contains the project's source code. + +### Customising your project structure + +While the default Kedro structure is recommended for collaboration and standardisation, it is possible to adapt the folder structure if necessary. This flexibility allows you to tailor the project to your needs while maintaining a consistent and recognisable structure. + +The only technical requirement when organising code is that the `pipeline_registry.py` and `settings.py` files must remain in the `/src/` directory, where they are created by default. + +The `pipeline_registry.py` file must include a `register_pipelines()` function that returns a `dict[str, Pipeline]`, which maps pipeline names to their corresponding `Pipeline` objects.