-
Notifications
You must be signed in to change notification settings - Fork 0
Introduction to Apache Airflow for DLME
Introduction to what airflow is and how it is being used for DLME
- Enable/Disable DAG: A DAG will not run (even manually) unless enabled
- DAG name & tags: Clicking on the label will display the DAG
- Runs: Displayed in Successful/Running/Failed order. Clicking each will display a list of dag runs
- Schedule: How is this DAG scheduled - following cron syntax and special commands (i.e.
@yearly
,@once
) - Last Run: Links to DAG view from the most recent DAG run
- Play: Manually trigger the DAG
- Reload: Refresh the DAG definition
- Delete: Delete the DAG
NOTE: This is the default display when navigating into a DAG. As a DAG grown in complexity, the task display can become hard to understand in the tree view - though the grid view of dag runs can be helpful when debugging.
This DAG view is generally more appealing and understanding. It displays and updates as a DAG runs, therefore the visual representation of where in the task list a particular dag run is can be very helfpul.
Here we see a simple DAG with five (5) tasks:
- configure_git
- validate_metadata_folder
- clone_metadata
- pull_metadata
- finished_pulling
The graph display makes it clear than validate_metadata_folder
results in a branch between clone_metadata
and pull_metadata
and runs after configure_git. The final task, finished_pulling
is a DummyOperator
- a place holder task used for control flow.
The border color of the tasks in this display is important, and a key is provided at the top of the display. Here we see that configure_git
, validate_metadata_folder
, clone_metadata
, and finished_pulling
each have a dark green border indicating SUCCESS
. The pull_metadata
task has a pink border, indicating SKIPPED
.
This indicates that:
-
configure_git
ran and completed with aSUCCESS
state. -
validate_metadata_folder
then ran and completed with aSUCCESS
state. It also returned a value that forced triggering ofclone_metadata
and skipping ofpull_metadata
. -
finished_pulling
captured the flow betweenclone_metadata
andpull_metadata
and ended in aSUCCESS
state.
When displaying a DAG run, hovering over a task will display information about the task run:
Clicking on a task in graph view opens a modal to dive deeper into a task run instance. Most helpful features are:
- Log: Any log output from the individual task instance. This is very helpful as individual task logs not lost in a full DAG or application log.
- Run: Run an individual task.
Here we see a DAG run display for a DAG that includes TaskGroups. TaskGroups represent a reusable task structure that can be included in many DAGS without code duplication.
Here we see the validate_metadata
DAG from above included as a task within this DAG as a TaskGroup. This allows us to reuse this set of tasks in any DAG without code duplication.
TaskGroups are a very useful tool for complex sets of tasks, as well as generating tasks to make writing DAGs easier. Here we see a complex set of 20 harvest tasks that were generated based on the provider configuration and run in parallel.