Skip to content

Commit

Permalink
[Docs]Document clarifying notes about the data lifecycle (#5922)
Browse files Browse the repository at this point in the history
* add information about deleting raw data in data_management.rst

Signed-off-by: Alex Wu <c.alexwu@gmail.com>

* fix example code error

Signed-off-by: Alex Wu <c.alexwu@gmail.com>

* delete example code task decorator arguments

Signed-off-by: Alex Wu <c.alexwu@gmail.com>

* adjust the location of own datastores related information

Signed-off-by: Alex Wu <c.alexwu@gmail.com>

---------

Signed-off-by: Alex Wu <c.alexwu@gmail.com>
  • Loading branch information
popojk authored Oct 31, 2024
1 parent 2b0af2b commit 6b81b16
Showing 1 changed file with 23 additions and 18 deletions.
41 changes: 23 additions & 18 deletions docs/user_guide/concepts/main_concepts/data_management.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,17 +159,6 @@ Between Tasks

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/9cb3d56d7f3b88622749b41ff7ad2d3ebce92726/flyte/concepts/data_movement/flyte_data_transfer.png


Bringing in Your Own Datastores for Raw Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Flytekit has a pluggable data persistence layer.
This is driven by PROTOCOL.
For example, it is theoretically possible to use S3 ``s3://`` for metadata and GCS ``gcs://`` for raw data. It is also possible to create your own protocol ``my_fs://``, to change how data is stored and accessed.
But for Metadata, the data should be accessible to Flyte control plane.

Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib.

Practical Example
~~~~~~~~~~~~~~~~~

Expand All @@ -180,19 +169,18 @@ The first task reads a file from the object store, shuffles the data, saves to l
.. code-block:: python
@task()
def task_remove_column(input_file: FlyteFile, column_name: str) -> FlyteFile:
def task_read_and_shuffle_file(input_file: FlyteFile) -> FlyteFile:
"""
Reads the input file as a DataFrame, removes a specified column, and outputs it as a new file.
Reads the input file as a DataFrame, shuffles the rows, and writes the shuffled DataFrame to a new file.
"""
input_file.download()
df = pd.read_csv(input_file.path)
# remove column
if column_name in df.columns:
df = df.drop(columns=[column_name])
# Shuffle the DataFrame rows
shuffled_df = df.sample(frac=1).reset_index(drop=True)
output_file_path = "data_finished.csv"
df.to_csv(output_file_path, index=False)
output_file_path = "data_shuffle.csv"
shuffled_df.to_csv(output_file_path, index=False)
return FlyteFile(output_file_path)
...
Expand Down Expand Up @@ -241,3 +229,20 @@ First task output metadata:
Second task input metadata:

.. image:: https://raw.githubusercontent.com/flyteorg/static-resources/9cb3d56d7f3b88622749b41ff7ad2d3ebce92726/flyte/concepts/data_movement/flyte_data_movement_example_input.png

Bringing in Your Own Datastores for Raw Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Flytekit has a pluggable data persistence layer.
This is driven by PROTOCOL.
For example, it is theoretically possible to use S3 ``s3://`` for metadata and GCS ``gcs://`` for raw data. It is also possible to create your own protocol ``my_fs://``, to change how data is stored and accessed.
But for Metadata, the data should be accessible to Flyte control plane.

Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib.

Deleting Raw Data in Your Own Datastores
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Flyte does not offer a direct function to delete raw data stored in external datastores like ``S3`` or ``GCS``. However, you can manage deletion by configuring a lifecycle policy within your datastore service.

If caching is enabled in your Flyte ``task``, ensure that the ``max-cache-age`` is set to be shorter than the lifecycle policy in your datastore to prevent potential data inconsistency issues.

0 comments on commit 6b81b16

Please sign in to comment.