Allow external writes to a Result, bypassing .write() #4597

orf · 2021-05-28T12:43:13Z

orf
May 28, 2021

Currently the Result interface is a convenient way to write values to permanent storage whilst keeping your code agnostic to the actual location via templating and result subclassing. It's one of Prefect's best features. However it is somewhat inflexible and inefficient when working with large amounts of data and I think it could be improved with some simple extensions.

Currently the only way to write data to a Result is to use the write() interface. If we imagine we have a very large pd.DataFrame that we want to persist to a Result in S3, then a simplified version of the current implementation in Prefect looks like this:

s3_location = ...
bytes_io = io.BytesIO()
pandas_dataframe.to_csv(bytes_io)
s3_client.upload_fileobj(bytes_io, s3_location)
result.location = s3_location

This isn't ideal. We keep the large dataframe in memory whilst we serialize it to a bytearray, which we then upload via boto3. A faster and more efficient flow would be this:

s3_location = f's3://{bucket}/{s3_key}'
pandas_dataframe.to_csv(s3_location)
result.location = s3_location

This skips the serializer entirely and lets pandas stream the dataframe into storage without needing to buffer the entire thing in memory. When writing to filesystems it can take advantage of memory mapping or other optimizations not present with the simple "serialize to a bytearray and call write()". The same applies for Dask dataframes.

But this is annoying to do with the current interface, as a lot of the logic around creating a path to the storage location is locked away in .write(). Here's how you might do it right now:

if isinstance(result, LocalResult):
    path = os.path.join(result.dir, result.location)
elif isinstance(result, S3Location):
    path = f"s3://{result.bucket}/{result.location}"
pandas_dataframe.to_csv(path)

This isn't great! It breaks the result abstraction completely. Concretely, this is what I want to be able to do:

@task
def create_and_save_dataframe(fame):
    frame = make_random_dataframe()
    result_object = prefect.context.result
    output_uri = result_object.get_uri()
    frame.to_csv(output_uri)
    return result_object

with Flow("test") as flow:
    save_dataframe()

And I'd like this to work locally (i.e a filesystem) or in production (using object storage), without needing to adapt the code. The result of the flow is the output location where the dataframe can be found, not the dataframe itself.

To put it succinctly: it would be nice to be able to retrieve a URI representing a result in a completely agnostic way, so that code in my tasks can write data to this location without being forced to use the Result.save() method directly or indirectly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow external writes to a Result, bypassing .write() #4597

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Allow external writes to a Result, bypassing .write() #4597

orf May 28, 2021

Replies: 0 comments

orf
May 28, 2021