Airflow on Kubernetes with Helm

You,airflowk8s

Read and follow me on Medium (opens in a new tab)

In the cloud-native world, Kubernetes and Helm are becoming the de-facto standards for deploying, managing, and scaling applications. As a data engineer, you may often find yourself in a situation where you need to manage complex data pipelines. This is where Apache Airflow comes in. Airflow is a robust open-source platform that lets you programmatically author, schedule, and monitor workflows. Combining the power of Airflow with the resilience and scalability of Kubernetes, we can create a highly reliable data pipeline management system.

Why Deploy Airflow on Kubernetes?

Source: https://blog.locale.ai/we-were-all-using-airflow-wrong-and-now-its-fixed/

Deploying Airflow on Kubernetes has several advantages over other deployment methods. Traditionally, Airflow is deployed on virtual machines or bare-metal servers. As your data processing needs grow, you need to manually handle the scaling, which can be tedious and error-prone.

On the other hand, Kubernetes is an open-source platform for automating the deployment, scaling, and management of containerized applications. It provides a more efficient and seamless way to deploy, scale, and manage Airflow.

Benefits of Executors

Airflow comes with several types of executors (opens in a new tab), each having its advantages. When deploying Airflow on Kubernetes, the Kubernetes executor brings significant benefits. The Kubernetes executor creates a new pod for every task instance. It means each task runs in isolation and uses resources optimally. You don’t have to worry about one task affecting another due to shared resources.

Source: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/kubernetes.html

This level of isolation makes debugging simpler. If a task fails, you can examine the pod’s logs and status without worrying about other tasks’ interference. Scaling becomes a breeze with the Kubernetes executor. It scales up when there are many tasks to run and scales down when there are fewer tasks. You only use the resources you need, leading to cost efficiency.

Deploying Airflow on Kubernetes with Helm

Helm is a package manager for Kubernetes that simplifies the deployment of applications on Kubernetes. It uses a packaging format called charts. A Helm chart is a collection of files that describe a related set of Kubernetes resources. In this tutorial, I will install and run Airlfow on Google Kubernetes Engine (GKE) (opens in a new tab).

To deploy Airflow on Kubernetes using the official Airflow Helm chart (opens in a new tab), you need to follow these steps:

Notes: In addition, the community also has a helm chart (opens in a new tab), but in this article we only use the official Helm chart (opens in a new tab) from Airflow.

1. Install Helm

Depending on your operating system, you can find different installation instructions in the official Helm documentation (opens in a new tab).

2. Add the Helm chart repository for Airflow

Use the command below to add the official Airflow chart repository:

helm repo add apache-airflow https://airflow.apache.org

3. Update your Helm repository

Use the command below to make sure you have the latest version of the chart:

helm repo update

4. Customize your installation and Install Airflow:

The Helm chart comes with default values (opens in a new tab) that might not fit your needs.

You can override these values with a custom YAML file.

# values.yaml  
# Airflow executor  
executor: "KubernetesExecutor"

Use the command below to create namespace airflow:

kubectl create namespace airflow && kubectl config set-context --current --namespace=airflow

Use the command below to install Airflow in namespace airflow:

helm upgrade --install airflow apache-airflow/airflow --namespace airflow --f values.yaml 

If completed, you may see results like this:

-> kubectl get pods,svc -n airflow                        
  
NAME                                     READY   STATUS    RESTARTS   AGE  
pod/airflow-postgresql-0                 1/1     Running   0          1d  
pod/airflow-scheduler-8598d7458f-2bw44   3/3     Running   0          1d17h  
pod/airflow-statsd-665cc8554c-6jqc4      1/1     Running   0          1d  
pod/airflow-triggerer-0                  3/3     Running   0          1d17h  
pod/airflow-webserver-77cd74fb86-2xvhv   1/1     Running   0          1d17h  
  
NAME                            TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE  
service/airflow-postgresql      ClusterIP   10.72.3.250    <none>        5432/TCP            1d  
service/airflow-postgresql-hl   ClusterIP   None           <none>        5432/TCP            1d  
service/airflow-statsd          ClusterIP   10.72.5.171    <none>        9125/UDP,9102/TCP   1d  
service/airflow-triggerer       ClusterIP   None           <none>        8794/TCP            1d  
service/airflow-webserver       ClusterIP   10.72.10.125   <none>        8080/TCP            1d

Now, you can access Airflow UI by using kubectl port-forward:

kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow

The web server can now be accessed on localhost:8080. The default credentials are username admin and password admin.

Airflow UI

Automatically pull Airflow DAGs from a private GitHub repository with the git-sync feature

In the dynamic world of data engineering and workflow automation, staying agile and organized is essential. Imagine having your DAGs always up-to-date, seamlessly accessible by your team, and securely stored in a version-controlled environment. That’s where the magic of automatically pulling Airflow DAGs from a private GitHub repository comes into play.

1. Creating a private git repository and setting up the connection

To synchronize your Airflow DAGs, you need to establish a code repository where you can save your local DAGs. You are free to choose any code repository, but for this guide, GitHub will be used.

First, form a private repository to keep your DAGs.

airflow-on-k8s  
  └──airflow  
      └──dags  
          └── example_bash_operator.py

After this, create a deployment key for your repository to facilitate SSH access. You can generate an SSH key using ssh-keygen as shown below:

-> ssh-keygen -t rsa -b 4096 -C "your-mail@gmail.com"  

Generating public/private rsa key pair.  
Enter file in which to save the key (/Users/hungnguyen/.ssh/id_rsa): airflow_ssh_key  
Enter passphrase (empty for no passphrase):  
Enter same passphrase again:  
Your identification has been saved in airflow_ssh_key  
Your public key has been saved in airflow_ssh_key.pub

Now that you have created your keygen (airflow_ssh_key and airflow_ssh_key.pub), you need to go to your GitHub repository, click into Settings, and find ‘Deploy keys’:

To add a new deploy key, you will need the content inside the keygen you just generated. It is stored at airflow_ssh_key.pub. Copy the pub key and paste it into GitHub to create your deployment key:

After creating the deploy key, you will need to create a kubectl secret inside your cluster. To do that, create a YAML file and run the following command:

# airflow-ssh-secret.yaml  
apiVersion: v1  
kind: Secret  
metadata:  
  name: airflow-ssh-secret  
  namespace: airflow  
data:  
  gitSshKey: <contents of running 'base64 airflow_ssh_key'> # modify here
kubectl apply -f airflow-ssh-secret.yaml

After running the create secret command, you can check your secrets by the command line:

kubectl describe secret airflow-ssh-secret -n airflow

2. Editing the airflow helm YAML file to configure the GitSync feature

Now that you have created a git repository with a deploy key and a Kubernetes secret using kubectl CLI, it’s time to edit the YAML file that is used to configure the Airflow deployment.

# values.yaml  
gitSync:  
    enabled: true  
    repo: git@github.com:hungngph/airflow-on-k8s.git  
    branch: main  
    subPath: "airflow/dags"  
    sshKeySecret: airflow-ssh-secret

To apply the changes, just run the command:

helm upgrade airflow apache-airflow/airflow --namespace airflow --values values.yaml

This command will deploy airflow using the configuration settings inside the values.yaml file.

And that’s it! Every time you change your DAG locally and commit to your git repository, airflow will automatically apply the changes inside the airflow DAGs folder.

Integrate Google Cloud Storage for remote logging

The Airflow task in Kubernetes operates on pods, which are transient and can be initiated or terminated based on demand. If remote logging isn’t set up for your active tasks, you risk either being unable to view the logs or losing them if the pods are terminated. Setting up remote logging ensures the logs persist beyond the lifespan of the individual pods.

1. Creating a GCP service account

Create a service account and provide it with GCP access that includes Storage Object Admin permission. Then, generate a JSON key named “k8s-services-airflow-sc.json” from this service account.

Next, create Kubernetes secret from JSON key(k8s-services-airflow-sc.json) with this command:

kubectl create secret generic sc-key --from-file=key.json=/<path>/<to>/<sc>/k8s-services-airflow-sc.json -n airflow

2. Editing the airflow helm YAML file to configure the remote logging feature

# values.yaml  
# Environment variables for all airflow containers  
env:  
  - name: GOOGLE_APPLICATION_CREDENTIALS  
    value: "/opt/airflow/secrets/key.json"  
  
# Airflow scheduler settings  
scheduler:  
  extraVolumeMounts:  
    - name: google-cloud-key  
      mountPath: /opt/airflow/secrets  
  extraVolumes:  
    - name: google-cloud-key  
      secret:  
        secretName: sc-key  
  
# Airflow webserver settings  
webserver:  
  extraVolumeMounts:  
    - name: google-cloud-key  
      mountPath: /opt/airflow/secrets  
  extraVolumes:  
    - name: google-cloud-key  
      secret:  
        secretName: sc-key

# Airflow triggerer settings  
triggerer:  
  extraVolumeMounts:  
    - name: google-cloud-key  
      mountPath: /opt/airflow/secrets  
  extraVolumes:  
    - name: google-cloud-key  
      secret:  
        secretName: sc-key  
  
config:  
  logging:  
    remote_logging: 'True'  
    remote_base_log_folder: 'gs://airflow/logs/'  
    remote_log_conn_id: 'sc-key'  
    google_key_path:  "/opt/airflow/secrets/key.json"  

To apply the changes, just run the command:

helm upgrade airflow apache-airflow/airflow --namespace airflow --values values.yaml

This command will deploy airflow using the configuration settings inside the values.yaml file.

Now, you can view remote logs from GCS in the Airflow UI:

Additionally, you can look at this page (opens in a new tab) for things to consider when using this Aiflow Helm chart in a production environment

Conclusion

Deploying Airflow on Kubernetes with Helm is a powerful combination that brings scalability, resilience, and efficiency to your data pipelines. It leverages the best of both worlds: the workflow management capabilities of Airflow and the container orchestration capabilities of Kubernetes. With the added benefits of using the Kubernetes executor, you can rest assured that your data pipelines will be robust and reliable.

Reference