Apache Airflow Demystify - 1 - basics

understanding airflow

Apache Airflow Demystify - 1 - basics

What is Airflow

Apache Airflow is a popular open-source orchestrating tools for authoring, scheduling and monitoring data pipeline/workflow.

It has some modern data tech stack features such as

  • It is a workflow as a code that allows you to manage your workflow with proper version control.

  • horizontally scalable

Compare with Cron

don't get me wrong. I love cron! But compared with cron, the advantage of airflow is summarized in the table below

FeatureCronAirflow
User InterfaceLimited, often command line-basedPowerful web-based UI for managing workflows
ScalabilityLimited, runs on a single machineHorizontally scalable, can distribute tasks across multiple worker nodes
Task ManagementLimited, basic task schedulingSophisticated task scheduling with dependency management
ResourcesLow resource requirementsHigh resource requirements

Enough of introduction, let's dive deep into the details

Airflow Main Components

The overall architecture of how a data engineer interacts with airflow is illustrated in the diagram below

You, as a data engineer, write DAGs in python (will be covered later, think of it as code for now) to write up you workflow and you can interact via cli or web server UI to author and schedule your workflow.

It has five main components shown in the figure below

ComponentDescription
WebserverProvides a user interface for data engineer to monitor and control workflows.
SchedulerTriggers tasks and manages dependencies between tasks. dependencies are just a fancy word for relationship and execution order in this context.
ExecutorExecutes tasks on worker nodes and takes care of task distribution, resource allocation, and task queues.
WorkerExecutes tasks assigned to it by the executor and communicates with the executor to receive tasks and report status.
Metadata DatabaseStores information about workflows, tasks, and their dependencies. By default, airflow uses Postgres (tuple indexing for the win)!

What is a DAG?

Directed acyclic graph (DAG) is a concept in graph the data structure. In this section, let's understand

  • what is graph?

  • how does graph relate to airflow?

  • what is DAG in airflow?

Graph concept in airflow

To describe an arbitrary graph, you would use node and edge. It is described mathmatically shown below

$$G=\left(V,E\right)$$

where G the graph, V the vertices, E the edge.

In the context of airflow, the concept of node and edge is implemented with task and dependencies.

$$\begin{align} node &= task\\ edge &= dependencies \end{align}$$

DAG is a special form of graph that the idea is prevalent in modern data tech stack such as airflow, pyspark and dbt (lineage graph). DAG means that it's a graph consisting of edge and vertice and the edge is pointing from one vertice to another while any loop in it.

DAG is extremely useful to describe workflow and a real world use case is illustrated in the figure below

Directed Acyclic Graph (DAG) Overview & Use Cases | Hazelcast

DAG concept in airflow

As we discussed earlier, each node is a task and each edge is dependencies in airflow as illustrated in the figure below.

Each task is fulfilled by one operator such as BashOperator to perform bash operation, PythonOperator to run Python function. It looks something like this

default_args = {
    'owner': 'adamzhang',
    'retries': 5,
    'retry_dealy':timedelta(minutes=2)
}

with DAG (
    dag_id= "our_first_dag_v3",
    default_args = default_args,
    description= "this is our first dag we write",
    start_date = datetime(2023,5,16),
    schedule_interval = '@daily'
) as dag:
    # set up our first node
    task1 = BashOperator(
        task_id = 'first_task',
        bash_command = "echo hello dag"
    )

    task2 = BashOperator(
        task_id = 'second_task',
        bash_command = "sleep 5"
    )

    task3 = BashOperator(
        task_id = 'third_task',
        bash_command = "date"
    )

    # set up dependency
    # task1.set_downstream(task2)
    # task2.set_downstream(task3)

    # with bitshift operator
    task1 >> task2
    task2 >> task3

After the code snippet above, you can see that:

  • each task has been defined with BashOpearator

  • dependencies have been defined with the bitshift operator

Let's say you wish to run your DAG daily, each dag run has an execution date and task instance illustrated in the figure below.

The DAG code you have written is a blueprint (concept of class), and on each scheduled execution date, an instance of the DAG will be spun up and running.

Summary

In this article, we covered

  • what's airflow and how it differs from cron

  • what are the main components of Airflow

  • DAG concept in the context of airflow

  • what are task, dependencies, dag run, execution date and task instance.