Apache Airflow is currently one of the most popular task orchestration tools available. Especially for data warehouses and its data transformation processes, Airflow helps to schedule the tasks in the defined order. The developers can model their data transformation steps as a Directed Acyclic Graph (DAG), which is written in Python. For the Celery Executor, which is the most popular one if you are running Airflow in a containerised or “bare-metal” environment. However, this “traditional” executor do have the scaling limitations like only a single worker type as well as scaling only depending on the number of tasks running on a cluster instead of the actual worker load of the dedicated tasks. The newer Kubernetes Executor can solve this issue because of the possibilities that Kubernetes provides you natively. Within this tutorial, I show how you can setup your Airflow within a Kubernetes Cluster with the particular focus on a local development setup.
Requirements
For this tutorial, I assume that you are familiar with:
- Kubernetes
- Helm 3 as a package manager for Kubernetes
- General Airflow Architecture
Airfow Setup
For this tutorial, we are using the official Airflow Helm chart to set Airflow up. In the following, we will use the following folder structure. I will explain it later step by step:
my-airflow/ ├── dags/ │ └── my_dag.py ├── helm/ │ ├── Chart.yaml │ └── values.yaml ├── plugins/ │ └── my_dag.py ├── Dockerfile └── requirements.txt └── skaffold.yaml
The folder structure is similar to the original structure with the dags/
and plugins/
folder. However, there is now also a helm/
folder that contains the Airflow package for our Kubernetes Cluster. For that, I created an own helm chart that inherits the official Helm chart. The big benefit of this approach is that values of the official helm chart can be easily updated and reduce the complexity of the high number of input parameters:
# helm/Chart.yaml apiVersion: "v2" name: "my-airflow" description: "Helm chart for my-airflow" type: application version: "1.0.0" appVersion: "7.14.1" dependencies: - name: airflow version: 7.14.1 repository: "https://airflow-helm.github.io/charts"
The Chart.yaml
file uses the Airflow as dependency and overwrites its default parameters in the values.yaml
file:
airflow: airflow: executor: KubernetesExecutor workers: enabled: false flower: enabled: false redis: enabled: false logs: persistence: enabled: true
As an executor, we are using the KubernetesExecutor
which allows us to spawn our tasks across different worker groups in our Kubernetes cluster with the natively supported scaling capability of Kubernetes. In addition, workers can be disabled because the Airflow tasks do not run anymore on dedicated Airflow workers but on specific worker nodes that are already set up for our Cluster. We will talk later about an exemplary setup of such worker nodes. In addition, the monitoring tool flower is also not required with the Kubernetes Executor because the tasks and its nodes/workers can not be monitored with native Kubernetes tools, such as Prometheus. The scheduling of the tasks via redis queue across the nodes is now also replaced by the native Kubernetes Pod scheduling, which allows us also disable this feature.
In total, we see Kubernetes can take over many parts natively that would have been managed by extra Airflow components. This simplifies your overall architecture because you can use the same tools that you also use for your other services.
We can already run this setup in your cluster easily by executing:
helm dep up ./helm # install airflow dependency helm upgrade --install airflow ./helm
After that, Airflow should set up and we should be able to access the webserver. For that, we need in the first place to forward the service to our own machine:
kubectl port-forward svc/airflow-web 8080:8080
Now, you can reach already your Airflow installation on http://localhost:8080.
Local Development
After setting Airflow up, we also want to develop locally on our machines. As a developer, I am used to live-reload functionalities of my code but how can we manage this for our Kubernetes Airflow installation?
One answer is the tool Skaffold. It provides, similar to docker-compose, also the capability of syncing files into containers that enables developers to directly sync their code changes into the running Kubernetes pods, without restarting them. For that, Skaffold only requires a skaffold.yaml
file:
# skaffold.yaml apiVersion: skaffold/v2beta1 kind: Config build: artifacts: - image: airflow context: ./ sync: manual: - src: "dags/**/*.py" dest: dags strip: dags/ - src: "plugins/**/*.py" dest: plugins strip: plugins/ local: useDockerCLI: true deploy: helm: releases: - name: airflow chartPath: helm skipBuildDependencies: true values: airflow.airflow.image: airflow setValueTemplates: airflow.dags.persistence.enabled: false airflow.logs.persistence.enabled: true airflow.airflow.config.AIRFLOW__KUBERNETES__DAGS_IN_IMAGE: " True" airflow.airflow.config.GUNICORN_CMD_ARGS: "--log-level DEBUG" imageStrategy: helm: {} portForward: - resourceType: service resourceName: airflow-web port: 8080 localPort: 8080
This file does three things:
- Building a Docker-Image locally by using the
Dockerfile
in the root folder (incl. rebuild after Dockerfile file changes) - Syncing python files inside the
dags/
andplugins/
folder directly into the Airflow Webserver and Scheduler pods - Overwriting Airflow parameters so that dags can be found in the docker image. Otherwise, the file sync of Skaffold would not work
An exemplary Dockerfile
could look like that:
# Dockerfile, only used for local development FROM apache/airflow:1.10.12-python3.6 USER root RUN apt-get update \ && apt-get install -y gcc \ && rm -rf /var/lib/apt/lists/* USER airflow # install Python requirements, of not required COPY requirements.txt . RUN pip3 install --user -r requirements.txt COPY dags dags
When you installed Skaffold using the official installation, you can now already deploy the setup in your local Kubernetes Cluster via
skaffold dev --port-forward
After that, you already deployed an Airflow development setup that should make every developer happy. You should now be able to access the webserver on http://localhost:8080.
Conclusion
This blog post described the easy local Airflow setup on your local Kubernetes Cluster. With this approach, you have can easily develop your code because of the code sync capabilities, as well as the advantage of having a very similar environment that you already have on your production environment.
I hope this tutorial helped you and simplifies the development of your Airflow DAGs a little bit. I also uploaded the whole code on Github: https://github.com/CapChrisCap/kubernetes-airflow-setup
If you have any additional questions, feel free to leave a comment or contact me 🙂
Outlook
In this blog post we mainly covered the local development setup.Thanks to Kubernetes, the production setup also does not look way different. However, the Airflow Helm chart also have sweet features on which you should definitely have a look on:
- AWS S3 Log Storage
- DAGs and plugins release via git sync
- Airflow configurations setup via Kubernetes Secrets
For more, I would highly recommend looking deeper into the Airflow Helm chart.
Regarding multiple worker nodes, I can only recommend the use of Terraform modules if you are provisioning infrastructure. On AWS, you can e.g. use the official EKS module, where you can easily specify multiple worker types:
module "my-cluster" { source = "terraform-aws-modules/eks/aws" cluster_name = "my-cluster" cluster_version = "1.17" subnets = ["subnet-abcde012", "subnet-bcde012a", "subnet-fghi345a"] vpc_id = "vpc-1234556abcdef" worker_groups = [ { instance_type = "m4.large" asg_max_size = 5 } ] }