Apache Airflow

Apache Airflow programmatically creates, schedules and monitors workflows. Airflow uses workflows made of directed acyclic graphs (DAGs) of tasks.

Note: Airflow is currently in incubator status. Software in the Apache Incubator has not yet been fully endorsed by the Apache Software Foundation.

A directed acyclic graph is a construct of nodes and connectors (also called “edges”) where the connectors have direction, and you can start at any arbitrary node to travel through all connectors. Each connector is traversed once. Trees and network topologies are types of DAGs.

Airflow workflows have tasks whose output is another task’s input. Therefore, the ETL process is also a type of DAG. In each step, the output is used as the input of the next step and you cannot loop back to a previous step.

Connect all your data sources to any data warehouse

Defining workflows in code provides easier maintenance, testing and versioning.

How is Apache Airflow Different?

Airflow is not a data streaming platform. Tasks represent data movement, they do not move data in themselves. Thus, it is not an interactive ETL tool.

Airflow is a Python script that defines an Airflow DAG object. This object can then be used in Python to code the ETL process. Airflow uses Jinja Templating, which provides built-in parameters and macros (Jinja is a templating language for Python, modeled after Django templates) for Python programming.

Apache Airflow is a generic data toolbox that supports custom plugins. These plugins can add features, interact effectively with different data storage platforms (i.e. Amazon Redshift, MySQL), and handle more complex interactions with data and metadata.

Integrated with Amazon Web Services (AWS) and Google Cloud Platform (GCP) which includes BigQuery, Airflow has built in connections with these services. There are AWS and GCP hooks and operators available for Airflow and additional integrations may become available as Airflow matures.

Apache Airflow ETL Steps

NiFi supports directed graphs of data routing, transformation, and system mediation. Features include:

  1. Create and test a pipeline in Airflow with the necessary tasks and templates.
  2. Code and test the data movement (such as extract or load) needed for the pipeline.
  3. Merge the pipeline and data movement code into a code repository with a master scheduler.
  4. Modify the scheduler to trigger the Airflow/data movement to run at the times needed.
  5. Repeat steps 1-4 for each ETL process.

Airflow UI - DAG View and Task Duration

DAG view:

Airflow DAG View

Image source: https://airflow.incubator.apache.org/ui.html

Task duration:

Airflow DAG View

Image source: https://airflow.incubator.apache.org/ui.html

Learn More

To get started with Apache Airflow, see the official documentation and the tutorial.

Connect all your data sources to any data warehouse