Merge branch 'main' into chore/hierarchical-ruff-config

kedro-org · Aug 17, 2023 · 5f0227d · 5f0227d
2 parents d49be70 + 16dd1df
commit 5f0227d
Show file tree

Hide file tree

Showing 77 changed files with 274 additions and 254 deletions.
diff --git a/RELEASE.md b/RELEASE.md
@@ -29,6 +29,12 @@
 ## Breaking changes to the API
 
 ## Upcoming deprecations for Kedro 0.19.0
+* Renamed abstract dataset classes, in accordance with the [Kedro lexicon](https://github.com/kedro-org/kedro/wiki/Kedro-documentation-style-guide#kedro-lexicon). Dataset classes ending with "DataSet" are deprecated and will be removed in 0.19.0. Note that all of the below classes are also importable from `kedro.io`; only the module where they are defined is listed as the location.
+
+| Type                       | Deprecated Alias           | Location        |
+| -------------------------- | -------------------------- | --------------- |
+| `AbstractDataset`          | `AbstractDataSet`          | `kedro.io.core` |
+| `AbstractVersionedDataset` | `AbstractVersionedDataSet` | `kedro.io.core` |
 
 # Release 0.18.12
 

diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md
@@ -783,7 +783,7 @@ gear = cars["gear"].values
 The following steps happened behind the scenes when `load` was called:
 
 - The value `cars` was located in the Data Catalog
-- The corresponding `AbstractDataSet` object was retrieved
+- The corresponding `AbstractDataset` object was retrieved
 - The `load` method of this dataset was called
 - This `load` method delegated the loading to the underlying pandas `read_csv` function
 

diff --git a/docs/source/data/kedro_io.md b/docs/source/data/kedro_io.md
@@ -1,7 +1,7 @@
 # Kedro IO
 
 
-In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataSet](/kedro.io.AbstractDataSet) and [kedro.io.DataSetError](/kedro.io.DataSetError).
+In this tutorial, we cover advanced uses of [the Kedro IO module](/kedro.io) to understand the underlying implementation. The relevant API documentation is [kedro.io.AbstractDataset](/kedro.io.AbstractDataset) and [kedro.io.DataSetError](/kedro.io.DataSetError).
 
 ## Error handling
 
@@ -21,9 +21,9 @@ except DataSetError:
 ```
 
 
-## AbstractDataSet
+## AbstractDataset
 
-To understand what is going on behind the scenes, you should study the [AbstractDataSet interface](/kedro.io.AbstractDataSet). `AbstractDataSet` is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataSet` implementation.
+To understand what is going on behind the scenes, you should study the [AbstractDataset interface](/kedro.io.AbstractDataset). `AbstractDataset` is the underlying interface that all datasets extend. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.
 
 If you have a dataset called `parts`, you can make direct calls to it like so:
 
@@ -33,13 +33,13 @@ parts_df = parts.load()
 
 We recommend using a `DataCatalog` instead (for more details, see [the `DataCatalog` documentation](../data/data_catalog.md)) as it has been designed to make all datasets available to project members.
 
-For contributors, if you would like to submit a new dataset, you must extend the `AbstractDataSet`. For a complete guide, please read [the section on custom datasets](../extend_kedro/custom_datasets.md).
+For contributors, if you would like to submit a new dataset, you must extend the `AbstractDataset`. For a complete guide, please read [the section on custom datasets](../extend_kedro/custom_datasets.md).
 
 
 ## Versioning
 
 In order to enable versioning, you need to update the `catalog.yml` config file and set the `versioned` attribute to `true` for the given dataset. If this is a custom dataset, the implementation must also:
-  1. extend `kedro.io.core.AbstractVersionedDataSet` AND
+  1. extend `kedro.io.core.AbstractVersionedDataset` AND
   2. add `version` namedtuple as an argument to its `__init__` method AND
   3. call `super().__init__()` with positional arguments `filepath`, `version`, and, optionally, with `glob` and `exists` functions if it uses a non-local filesystem (see [kedro_datasets.pandas.CSVDataSet](/kedro_datasets.pandas.CSVDataSet) as an example) AND
   4. modify its `_describe`, `_load` and `_save` methods respectively to support versioning (see [`kedro_datasets.pandas.CSVDataSet`](/kedro_datasets.pandas.CSVDataSet) for an example implementation)
@@ -55,10 +55,10 @@ from pathlib import Path, PurePosixPath
 
 import pandas as pd
 
-from kedro.io import AbstractVersionedDataSet
+from kedro.io import AbstractVersionedDataset
 
 
-class MyOwnDataSet(AbstractVersionedDataSet):
+class MyOwnDataSet(AbstractVersionedDataset):
     def __init__(self, filepath, version, param1, param2=True):
         super().__init__(PurePosixPath(filepath), version)
         self._param1 = param1
@@ -314,7 +314,7 @@ Here is an exhaustive list of the arguments supported by `PartitionedDataSet`:
 | Argument          | Required                       | Supported types                                  | Description                                                                                                                                                                                                                                   |
 | ----------------- | ------------------------------ | ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `path`            | Yes                            | `str`                                            | Path to the folder containing partitioned data. If path starts with the protocol (e.g., `s3://`) then the corresponding `fsspec` concrete filesystem implementation will be used. If protocol is not specified, local filesystem will be used |
-| `dataset`         | Yes                            | `str`, `Type[AbstractDataSet]`, `Dict[str, Any]` | Underlying dataset definition, for more details see the section below                                                                                                                                                                         |
+| `dataset`         | Yes                            | `str`, `Type[AbstractDataset]`, `Dict[str, Any]` | Underlying dataset definition, for more details see the section below                                                                                                                                                                         |
 | `credentials`     | No                             | `Dict[str, Any]`                                 | Protocol-specific options that will be passed to `fsspec.filesystemcall`, for more details see the section below                                                                                                                              |
 | `load_args`       | No                             | `Dict[str, Any]`                                 | Keyword arguments to be passed into `find()` method of the corresponding filesystem implementation                                                                                                                                            |
 | `filepath_arg`    | No                             | `str` (defaults to `filepath`)                   | Argument name of the underlying dataset initializer that will contain a path to an individual partition                                                                                                                                       |
@@ -326,7 +326,7 @@ Dataset definition should be passed into the `dataset` argument of the `Partitio
 
 ##### Shorthand notation
 
-Requires you only to specify a class of the underlying dataset either as a string (e.g. `pandas.CSVDataSet` or a fully qualified class path like `kedro_datasets.pandas.CSVDataSet`) or as a class object that is a subclass of the [AbstractDataSet](/kedro.io.AbstractDataSet).
+Requires you only to specify a class of the underlying dataset either as a string (e.g. `pandas.CSVDataSet` or a fully qualified class path like `kedro_datasets.pandas.CSVDataSet`) or as a class object that is a subclass of the [AbstractDataset](/kedro.io.AbstractDataset).
 
 ##### Full notation
 

diff --git a/docs/source/deployment/dask.md b/docs/source/deployment/dask.md
@@ -44,14 +44,14 @@ from kedro.framework.hooks.manager import (
     _register_hooks_setuptools,
 )
 from kedro.framework.project import settings
-from kedro.io import AbstractDataSet, DataCatalog
+from kedro.io import AbstractDataset, DataCatalog
 from kedro.pipeline import Pipeline
 from kedro.pipeline.node import Node
 from kedro.runner import AbstractRunner, run_node
 from pluggy import PluginManager
 
 
-class _DaskDataSet(AbstractDataSet):
+class _DaskDataSet(AbstractDataset):
     """``_DaskDataSet`` publishes/gets named datasets to/from the Dask
     scheduler."""
 

diff --git a/docs/source/extend_kedro/custom_datasets.md b/docs/source/extend_kedro/custom_datasets.md
@@ -24,13 +24,13 @@ Consult the [Pillow documentation](https://pillow.readthedocs.io/en/stable/insta
 
 ## The anatomy of a dataset
 
-At the minimum, a valid Kedro dataset needs to subclass the base [AbstractDataSet](/kedro.io.AbstractDataSet) and provide an implementation for the following abstract methods:
+At the minimum, a valid Kedro dataset needs to subclass the base [AbstractDataset](/kedro.io.AbstractDataset) and provide an implementation for the following abstract methods:
 
 * `_load`
 * `_save`
 * `_describe`
 
-`AbstractDataSet` is generically typed with an input data type for saving data, and an output data type for loading data.
+`AbstractDataset` is generically typed with an input data type for saving data, and an output data type for loading data.
 This typing is optional however, and defaults to `Any` type.
 
 Here is an example skeleton for `ImageDataSet`:
@@ -43,10 +43,10 @@ from typing import Any, Dict
 
 import numpy as np
 
-from kedro.io import AbstractDataSet
+from kedro.io import AbstractDataset
 
 
-class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
     """``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
 
     Example:
@@ -108,11 +108,11 @@ import fsspec
 import numpy as np
 from PIL import Image
 
-from kedro.io import AbstractDataSet
+from kedro.io import AbstractDataset
 from kedro.io.core import get_filepath_str, get_protocol_and_path
 
 
-class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
     def __init__(self, filepath: str):
         """Creates a new instance of ImageDataSet to load / save image data for given filepath.
 
@@ -169,7 +169,7 @@ Similarly, we can implement the `_save` method as follows:
 
 
 ```python
-class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
     def _save(self, data: np.ndarray) -> None:
         """Saves image data to the specified filepath."""
         # using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
@@ -193,7 +193,7 @@ You can open the file to verify that the data was written back correctly.
 The `_describe` method is used for printing purposes. The convention in Kedro is for the method to return a dictionary describing the attributes of the dataset.
 
 ```python
-class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
     def _describe(self) -> Dict[str, Any]:
         """Returns a dict that describes the attributes of the dataset."""
         return dict(filepath=self._filepath, protocol=self._protocol)
@@ -214,11 +214,11 @@ import fsspec
 import numpy as np
 from PIL import Image
 
-from kedro.io import AbstractDataSet
+from kedro.io import AbstractDataset
 from kedro.io.core import get_filepath_str, get_protocol_and_path
 
 
-class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
     """``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
 
     Example:
@@ -301,7 +301,7 @@ $ ls -la data/01_raw/pokemon-images-and-types/images/images/*.png | wc -l
 Versioning doesn't work with `PartitionedDataSet`. You can't use both of them at the same time.
 ```
 To add [Versioning](../data/kedro_io.md#versioning) support to the new dataset we need to extend the
- [AbstractVersionedDataSet](/kedro.io.AbstractVersionedDataSet) to:
+ [AbstractVersionedDataset](/kedro.io.AbstractVersionedDataset) to:
 
 * Accept a `version` keyword argument as part of the constructor
 * Adapt the `_save` and `_load` method to use the versioned data path obtained from `_get_save_path` and `_get_load_path` respectively
@@ -320,11 +320,11 @@ import fsspec
 import numpy as np
 from PIL import Image
 
-from kedro.io import AbstractVersionedDataSet
+from kedro.io import AbstractVersionedDataset
 from kedro.io.core import get_filepath_str, get_protocol_and_path, Version
 
 
-class ImageDataSet(AbstractVersionedDataSet[np.ndarray, np.ndarray]):
+class ImageDataSet(AbstractVersionedDataset[np.ndarray, np.ndarray]):
     """``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
 
     Example:
@@ -391,14 +391,14 @@ The difference between the original `ImageDataSet` and the versioned `ImageDataS
  import numpy as np
  from PIL import Image
 
--from kedro.io import AbstractDataSet
+-from kedro.io import AbstractDataset
 -from kedro.io.core import get_filepath_str, get_protocol_and_path
-+from kedro.io import AbstractVersionedDataSet
++from kedro.io import AbstractVersionedDataset
 +from kedro.io.core import get_filepath_str, get_protocol_and_path, Version
 
 
--class ImageDataSet(AbstractDataSet[np.ndarray, np.ndarray]):
-+class ImageDataSet(AbstractVersionedDataSet[np.ndarray, np.ndarray]):
+-class ImageDataSet(AbstractDataset[np.ndarray, np.ndarray]):
++class ImageDataSet(AbstractVersionedDataset[np.ndarray, np.ndarray]):
      """``ImageDataSet`` loads / save image data from a given filepath as `numpy` array using Pillow.
 
      Example:
@@ -537,7 +537,7 @@ These parameters are then passed to the dataset constructor so you can use them
 import fsspec
 
 
-class ImageDataSet(AbstractVersionedDataSet):
+class ImageDataSet(AbstractVersionedDataset):
     def __init__(
         self,
         filepath: str,

diff --git a/docs/source/extend_kedro/plugins.md b/docs/source/extend_kedro/plugins.md
@@ -196,7 +196,7 @@ When you are ready to submit your code:
 ## Supported Kedro plugins
 
 - [Kedro-Datasets](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets), a collection of all of Kedro's data connectors. These data
-connectors are implementations of the `AbstractDataSet`
+connectors are implementations of the `AbstractDataset`
 - [Kedro-Docker](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker), a tool for packaging and shipping Kedro projects within containers
 - [Kedro-Airflow](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow), a tool for converting your Kedro project into an Airflow project
 - [Kedro-Viz](https://github.com/kedro-org/kedro-viz), a tool for visualising your Kedro pipelines

diff --git a/docs/source/kedro.io.rst b/docs/source/kedro.io.rst
@@ -11,8 +11,8 @@ kedro.io
    :toctree:
    :template: autosummary/class.rst
 
-   kedro.io.AbstractDataSet
-   kedro.io.AbstractVersionedDataSet
+   kedro.io.AbstractDataset
+   kedro.io.AbstractVersionedDataset
    kedro.io.CachedDataSet
    kedro.io.CachedDataset
    kedro.io.DataCatalog

diff --git a/docs/source/nodes_and_pipelines/nodes.md b/docs/source/nodes_and_pipelines/nodes.md
@@ -287,18 +287,18 @@ def report_accuracy(y_pred: pd.Series, y_test: pd.Series):
 </details>
 
 
-The `ChunkWiseDataset` is a variant of the `pandas.CSVDataset` where the main change is to the `_save` method that appends data instead of overwriting it. You need to create a file `src/<package_name>/chunkwise.py` and put this class inside it. Below is an example of the `ChunkWiseCSVDataset` implementation:
+The `ChunkWiseCSVDataset` is a variant of the `pandas.CSVDataSet` where the main change is to the `_save` method that appends data instead of overwriting it. You need to create a file `src/<package_name>/chunkwise.py` and put this class inside it. Below is an example of the `ChunkWiseCSVDataset` implementation:
 
 ```python
 import pandas as pd
 
 from kedro.io.core import (
     get_filepath_str,
 )
-from kedro.extras.datasets.pandas import CSVDataset
+from kedro.extras.datasets.pandas import CSVDataSet
 
 
-class ChunkWiseCSVDataset(CSVDataset):
+class ChunkWiseCSVDataset(CSVDataSet):
     """``ChunkWiseCSVDataset`` loads/saves data from/to a CSV file using an underlying
     filesystem. It uses pandas to handle the CSV file.
     """
@@ -319,20 +319,20 @@ After that, you need to update the `catalog.yml` to use this new dataset.
 
 ```diff
 + y_pred:
-+  type: <package_name>.chunkwise.ChunkWiseCSVDataSet
++  type: <package_name>.chunkwise.ChunkWiseCSVDataset
 +  filepath: data/07_model_output/y_pred.csv
 ```
 
-With these changes, when you run `kedro run` in your terminal, you should see `y_pred`` being saved multiple times in the logs as the generator lazily processes and saves the data in smaller chunks.
+With these changes, when you run `kedro run` in your terminal, you should see `y_pred` being saved multiple times in the logs as the generator lazily processes and saves the data in smaller chunks.
 
 ```
 ...
                     INFO     Loading data from 'y_train' (MemoryDataset)...                                                                                         data_catalog.py:475
                     INFO     Running node: make_predictions: make_predictions([X_train,X_test,y_train]) -> [y_pred]                                                         node.py:331
-                    INFO     Saving data to 'y_pred' (ChunkWiseCSVDataSet)...                                                                                       data_catalog.py:514
-                    INFO     Saving data to 'y_pred' (ChunkWiseCSVDataSet)...                                                                                       data_catalog.py:514
-                    INFO     Saving data to 'y_pred' (ChunkWiseCSVDataSet)...                                                                                       data_catalog.py:514
+                    INFO     Saving data to 'y_pred' (ChunkWiseCSVDataset)...                                                                                       data_catalog.py:514
+                    INFO     Saving data to 'y_pred' (ChunkWiseCSVDataset)...                                                                                       data_catalog.py:514
+                    INFO     Saving data to 'y_pred' (ChunkWiseCSVDataset)...                                                                                       data_catalog.py:514
                     INFO     Completed 2 out of 3 tasks                                                                                                         sequential_runner.py:85
-                    INFO     Loading data from 'y_pred' (ChunkWiseCSVDataSet)...                                                                                    data_catalog.py:475
+                    INFO     Loading data from 'y_pred' (ChunkWiseCSVDataset)...                                                                                    data_catalog.py:475
 ...                                                                              runner.py:105
 ```