Skip to content

Introduction to Apache Airflow for DLME

Aaron Collier edited this page Sep 5, 2021 · 9 revisions

Apache Airflow for DLME

Introduction to what airflow is and how it is being used for DLME

Resources

  • This repository
  • Airflow documentation
  • Airflow AWS Deployment

Terms

DAG Dashboard

DLME Airflow dashboard

Important features

  • Enable/Disable DAG: A DAG will not run (even manually) unless enabled
  • DAG name & tags: Clicking on the label will display the DAG
  • Runs: Displayed in Successful/Running/Failed order. Clicking each will display a list of dag runs
  • Schedule: How is this DAG scheduled - following cron syntax and special commands (i.e. @yearly, @once)
  • Last Run: Links to DAG view from the most recent DAG run
  • Play: Manually trigger the DAG
  • Reload: Refresh the DAG definition
  • Delete: Delete the DAG

Default DAG Display - Tree View

NOTE: This is the default display when navigating into a DAG. As a DAG grown in complexity, the task display can become hard to understand in the tree view - though the grid view of dag runs can be helpful when debugging.

DAG Display - Default, Tree View

DAG Display - Graph View

This DAG view is generally more appealing and understanding. It displays and updates as a DAG runs, therefore the visual representation of where in the task list a particular dag run is can be very helfpul.

Here we see a simple DAG with five (5) tasks:

  • configure_git
  • validate_metadata_folder
  • clone_metadata
  • pull_metadata
  • finished_pulling

The graph display makes it clear than validate_metadata_folder results in a branch between clone_metadata and pull_metadata and runs after configure_git. The final task, finished_pulling is a DummyOperator - a place holder task used for control flow.

DAG Display - Graph View

The border color of the tasks in this display is important, and a key is provided at the top of the display. Here we see that configure_git, validate_metadata_folder, clone_metadata, and finished_pulling each have a dark green border indicating SUCCESS. The pull_metadata task has a pink border, indicating SKIPPED.

This indicates that:

  1. configure_git ran and completed with a SUCCESS state.
  2. validate_metadata_folder then ran and completed with a SUCCESS state. It also returned a value that forced triggering of clone_metadata and skipping of pull_metadata.
  3. finished_pulling captured the flow between clone_metadata and pull_metadata and ended in a SUCCESS state.