You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the Result interface is a convenient way to write values to permanent storage whilst keeping your code agnostic to the actual location via templating and result subclassing. It's one of Prefect's best features. However it is somewhat inflexible and inefficient when working with large amounts of data and I think it could be improved with some simple extensions.
Currently the only way to write data to a Result is to use the write() interface. If we imagine we have a very large pd.DataFrame that we want to persist to a Result in S3, then a simplified version of the current implementation in Prefect looks like this:
This isn't ideal. We keep the large dataframe in memory whilst we serialize it to a bytearray, which we then upload via boto3. A faster and more efficient flow would be this:
This skips the serializer entirely and lets pandas stream the dataframe into storage without needing to buffer the entire thing in memory. When writing to filesystems it can take advantage of memory mapping or other optimizations not present with the simple "serialize to a bytearray and call write()". The same applies for Dask dataframes.
But this is annoying to do with the current interface, as a lot of the logic around creating a path to the storage location is locked away in .write(). Here's how you might do it right now:
And I'd like this to work locally (i.e a filesystem) or in production (using object storage), without needing to adapt the code. The result of the flow is the output location where the dataframe can be found, not the dataframe itself.
To put it succinctly: it would be nice to be able to retrieve a URI representing a result in a completely agnostic way, so that code in my tasks can write data to this location without being forced to use the Result.save() method directly or indirectly.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Currently the
Result
interface is a convenient way to write values to permanent storage whilst keeping your code agnostic to the actual location via templating and result subclassing. It's one of Prefect's best features. However it is somewhat inflexible and inefficient when working with large amounts of data and I think it could be improved with some simple extensions.Currently the only way to write data to a Result is to use the
write()
interface. If we imagine we have a very largepd.DataFrame
that we want to persist to a Result in S3, then a simplified version of the current implementation in Prefect looks like this:This isn't ideal. We keep the large dataframe in memory whilst we serialize it to a bytearray, which we then upload via boto3. A faster and more efficient flow would be this:
This skips the serializer entirely and lets pandas stream the dataframe into storage without needing to buffer the entire thing in memory. When writing to filesystems it can take advantage of memory mapping or other optimizations not present with the simple "serialize to a bytearray and call write()". The same applies for Dask dataframes.
But this is annoying to do with the current interface, as a lot of the logic around creating a path to the storage location is locked away in
.write()
. Here's how you might do it right now:This isn't great! It breaks the result abstraction completely. Concretely, this is what I want to be able to do:
And I'd like this to work locally (i.e a filesystem) or in production (using object storage), without needing to adapt the code. The result of the flow is the output location where the dataframe can be found, not the dataframe itself.
To put it succinctly: it would be nice to be able to retrieve a URI representing a result in a completely agnostic way, so that code in my tasks can write data to this location without being forced to use the
Result.save()
method directly or indirectly.Beta Was this translation helpful? Give feedback.
All reactions