-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Passing Data Between Assets Guide #23598
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite. This stack of pull requests is managed by Graphite. Learn more about stacking. Join @PedramNavid and the rest of your teammates on Graphite |
✅ Deploy Preview for dagsterapidocs canceled.
|
Deploy preview for dagster-docs ready! Preview available at https://dagster-docs-ivp0ss0rf-elementl.vercel.app Direct link to changed pages: |
230b5f0
to
64e2508
Compare
@@ -0,0 +1,122 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamiedemaria Could I bother you to review this for accuracy?
It also creates a new Component called CodeExample which lets you embed and highlight code blocks.
64e2508
to
c706397
Compare
95aea5f
to
fee23c4
Compare
Deploy preview for dagster-docs-next ready! ✅ Preview Built with commit 8ae090a. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length, format, and code examples make sense to me. Minor comments in-line, and one major suggestion to add a fourth example of data assets that don't pass (or process) at all. I suggest this fourth example because I think it would resonate with teams coming from Airflow and "fills out" the mental model for how assets can be used to implement data pipelines.
docs/docs-next/docs/guides/data-assets/passing-data-between-assets.md
Outdated
Show resolved
Hide resolved
docs/docs-next/docs/guides/data-assets/passing-data-between-assets.md
Outdated
Show resolved
Hide resolved
In Dagster, assets are the building blocks of your data pipeline and it's common to want to pass data between them. This guide will help you understand how to pass data between assets. | ||
|
||
There are three ways of passing data between assets: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to include a fourth "fake" case which is "you do not pass data between assets because your pipeline is not processing data directly"
This would be something like:
@asset
def people():
""" call the lambda function that loads people"""
@asset
def birds():
""" call the lambda function that loads birds"""
@asset(
deps = [people, birds]
)
def people_and_birds():
""" call the stored procedure that concats people and birds"""
…sets.md Co-authored-by: Sean Lopp <lopp@elementl.com>
…sets.md Co-authored-by: Sean Lopp <lopp@elementl.com>
Merging this to keep things moving, feel free to continue reviewing however. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this breakdown into the different approaches. I agree with lopp that the fourth case would be really useful.
One thing I'm curious about is framing this as "passing data" vs like storing and loading assets. I think "passing data" is probably what a new dagster user would be looking to figure out how to do, but I think it also has an implication that the data is transitory. I think the final example does a good job of communicating "the output of each asset should be the actual data asset you want stored, not an intermediate state". Maybe a paragraph at the beginning that sets up that framework for the reader would be helpful
Also left a couple small copy-edit things I noticed while reading
|
||
This example works for local development, but in a production environment | ||
each step would execute in a separate environment and would not have access to the same file system. Consider a cloud-hosted environment for production purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Consider a cloud hosted environment for production purposes"
nit: is this referring to cloud hosted storage? using "environment" to refer to two different things (where the computation is performed and where the database is hosted) confused me a bit
2. **Output**: Writing data to the configured storage location. | ||
|
||
For a deeper understanding of IO Managers, check out the [Understanding IO Managers](/concepts/io-managers) guide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IO -> I/O (couple other places where this applies as well)
<CodeExample filePath="guides/data-assets/passing-data-assets/passing-data-rewrite-assets.py" language="python" title="Avoid Passing Data Between Assets" /> | ||
|
||
This approach still handles passing data explicitly, but no longer does it across assets, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a word missing in this sentence
To follow the steps in this guide, you'll need: | ||
|
||
- A basic understanding of Dagster concepts such as assets and resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something I really appreciate when I read other docs sites is when they link to a page that has the information i need to have a basic understanding (so maybe the concept pages for us?). Then I can open it and be like "i've read this im good" or if it's something i know nothing about I am given the resource i need to learn it.
Added an initial Guide for quick feedback.
Added a collapsible pre-req block .
Walk through three different ways of passing data between assets. I don't know how I feel about the last one.
Also created a new Component called CodeExample which lets you embedand highlight code blocks.
I have taken inspiration on how to write How To Guides from here: https://diataxis.fr/how-to-guides/
Outstanding Questions
How To
?