Kubernetes Airflow – Local Development Setup

Apache Airflow is currently one of the most popular task orchestration tools available. Especially for data warehouses and its data transformation processes, Airflow helps to schedule the tasks in the defined order. The developers can model their data transformation steps as a Directed Acyclic Graph (DAG), which is written in Python. For the Celery Executor, which is the most popular one if you are running Airflow in a containerised or “bare-metal” environment. However, this “traditional” executor do have the scaling limitations like only a single worker type as well as scaling only depending on the number of tasks running on a cluster instead of the actual worker load of the dedicated tasks. The newer Kubernetes Executor can solve this issue because of the possibilities that Kubernetes provides you natively. Within this tutorial, I show how you can setup your Airflow within a Kubernetes Cluster with the particular focus on a local development setup.

Requirements

For this tutorial, I assume that you are familiar with:

Airfow Setup

For this tutorial, we are using the official Airflow Helm chart to set Airflow up. In the following, we will use the following folder structure. I will explain it later step by step:

my-airflow/
├── dags/
│   └── my_dag.py
├── helm/
│   ├── Chart.yaml
│   └── values.yaml
├── plugins/
│   └── my_dag.py
├── Dockerfile
└── requirements.txt
└── skaffold.yaml

The folder structure is similar to the original structure with the dags/ and plugins/ folder. However, there is now also a helm/ folder that contains the Airflow package for our Kubernetes Cluster. For that, I created an own helm chart that inherits the official Helm chart. The big benefit of this approach is that values of the official helm chart can be easily updated and reduce the complexity of the high number of input parameters:

# helm/Chart.yaml
apiVersion: "v2"
name: "my-airflow"
description: "Helm chart for my-airflow"
type: application
version: "1.0.0"
appVersion: "7.14.1"
dependencies:
  - name: airflow
    version: 7.14.1
    repository: "https://airflow-helm.github.io/charts"

The Chart.yaml file uses the Airflow as dependency and overwrites its default parameters in the values.yaml file:

airflow:
  airflow:
    executor: KubernetesExecutor
  workers:
    enabled: false
  flower:
    enabled: false
  redis:
    enabled: false
  logs:
    persistence:
      enabled: true

As an executor, we are using the KubernetesExecutor which allows us to spawn our tasks across different worker groups in our Kubernetes cluster with the natively supported scaling capability of Kubernetes. In addition, workers can be disabled because the Airflow tasks do not run anymore on dedicated Airflow workers but on specific worker nodes that are already set up for our Cluster. We will talk later about an exemplary setup of such worker nodes. In addition, the monitoring tool flower is also not required with the Kubernetes Executor because the tasks and its nodes/workers can not be monitored with native Kubernetes tools, such as Prometheus. The scheduling of the tasks via redis queue across the nodes is now also replaced by the native Kubernetes Pod scheduling, which allows us also disable this feature.

In total, we see Kubernetes can take over many parts natively that would have been managed by extra Airflow components. This simplifies your overall architecture because you can use the same tools that you also use for your other services.

We can already run this setup in your cluster easily by executing:

helm dep up ./helm     # install airflow dependency 
helm upgrade --install airflow ./helm

After that, Airflow should set up and we should be able to access the webserver. For that, we need in the first place to forward the service to our own machine:

kubectl port-forward svc/airflow-web 8080:8080

Now, you can reach already your Airflow installation on http://localhost:8080.

Local Development

After setting Airflow up, we also want to develop locally on our machines. As a developer, I am used to live-reload functionalities of my code but how can we manage this for our Kubernetes Airflow installation?

One answer is the tool Skaffold. It provides, similar to docker-compose, also the capability of syncing files into containers that enables developers to directly sync their code changes into the running Kubernetes pods, without restarting them. For that, Skaffold only requires a skaffold.yaml file:

# skaffold.yaml
apiVersion: skaffold/v2beta1
kind: Config
build:
  artifacts:
    - image: airflow
      context: ./
      sync:
        manual:
          - src: "dags/**/*.py"
            dest: dags
            strip: dags/
          - src: "plugins/**/*.py"
            dest: plugins
            strip: plugins/
  local:
    useDockerCLI: true
deploy:
  helm:
    releases:
      - name: airflow
        chartPath: helm
        skipBuildDependencies: true
        values:
          airflow.airflow.image: airflow
        setValueTemplates:
          airflow.dags.persistence.enabled: false
          airflow.logs.persistence.enabled: true
          airflow.airflow.config.AIRFLOW__KUBERNETES__DAGS_IN_IMAGE: " True"
          airflow.airflow.config.GUNICORN_CMD_ARGS: "--log-level DEBUG"
        imageStrategy:
          helm: {}
portForward:
  - resourceType: service
    resourceName: airflow-web
    port: 8080
    localPort: 8080

This file does three things:

  • Building a Docker-Image locally by using the Dockerfile in the root folder (incl. rebuild after Dockerfile file changes)
  • Syncing python files inside the dags/ and plugins/ folder directly into the Airflow Webserver and Scheduler pods
  • Overwriting Airflow parameters so that dags can be found in the docker image. Otherwise, the file sync of Skaffold would not work

An exemplary Dockerfile could look like that:

# Dockerfile, only used for local development
FROM apache/airflow:1.10.12-python3.6

USER root

RUN apt-get update \
    && apt-get install -y gcc \
    && rm -rf /var/lib/apt/lists/*

USER airflow

# install Python requirements, of not required
COPY requirements.txt .
RUN pip3 install --user -r requirements.txt
COPY dags dags

When you installed Skaffold using the official installation, you can now already deploy the setup in your local Kubernetes Cluster via

skaffold dev --port-forward

After that, you already deployed an Airflow development setup that should make every developer happy. You should now be able to access the webserver on http://localhost:8080.

Conclusion

This blog post described the easy local Airflow setup on your local Kubernetes Cluster. With this approach, you have can easily develop your code because of the code sync capabilities, as well as the advantage of having a very similar environment that you already have on your production environment.

I hope this tutorial helped you and simplifies the development of your Airflow DAGs a little bit. I also uploaded the whole code on Github: https://github.com/CapChrisCap/kubernetes-airflow-setup
If you have any additional questions, feel free to leave a comment or contact me 🙂

Outlook

In this blog post we mainly covered the local development setup.Thanks to Kubernetes, the production setup also does not look way different. However, the Airflow Helm chart also have sweet features on which you should definitely have a look on:

  • AWS S3 Log Storage
  • DAGs and plugins release via git sync
  • Airflow configurations setup via Kubernetes Secrets

For more, I would highly recommend looking deeper into the Airflow Helm chart.

Regarding multiple worker nodes, I can only recommend the use of Terraform modules if you are provisioning infrastructure. On AWS, you can e.g. use the official EKS module, where you can easily specify multiple worker types:

module "my-cluster" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "my-cluster"
  cluster_version = "1.17"
  subnets         = ["subnet-abcde012", "subnet-bcde012a", "subnet-fghi345a"]
  vpc_id          = "vpc-1234556abcdef"

  worker_groups = [
    {
      instance_type = "m4.large"
      asg_max_size  = 5
    }
  ]
}

Leave a Reply

Your email address will not be published. Required fields are marked *