DataOps is clearly going to revolutionise the traditional data development mindset. This is not a nice-to-have anymore; actually, data projects could fail if DataOps principles are not adopted. Local development environment in data could be extremely complex, costly, and time-consuming, especially if we want to simulate a production-like environment. They do not just have to take care of simulating the environment but need to have data factory responsibilities as well.
When is it useful to have a production-like environment locally? Today’s orchestrators are spreading widely mainly because they can scale automatically depending on load and implementation. Testing these capabilities using the same production code locally requires a rock-solid, easily configurable development environment. And this is especially true for Airflow.
In this blog post, I’m going to show you how we create such an environment at Infinite Lambda.
Let’s talk about production environments
Apache Airflow is still one of the most widely used orchestration tools within the data industry. Nowadays, developer teams tend to use one flexible, elastic system to deploy their Airflow instances on the cloud. Of course, I’m talking about Kubernetes.
Kubernetes offers everything that your Airflow instance requires to scale according to the scheduled tasks in an easily configurable and safe way. On its own, Airflow is a flexible orchestrator tool that allows you to create impressively complex workflows. When such a tool is paired with Kubernetes, your imagination is the limit to what you can do with different executors and config options.
Yet, even when you have already set up a fully functioning, proper production environment, you still need to create an almost perfect copy of this production environment. Your options include doing everything again from scratch and creating another cluster in the same way you did for the production cluster. This obviously has many downsides. If you choose this solution, you will need to pay for two clusters even though one of them is solely used for development. Moreover, the developers will have to take turns testing their DAGs.
Let’s also not forget that you need to either implement an automated CI/CD pipeline to deploy to the cluster or do it manually. If you have a really amazing DevOps team, they can implement a solution that enables the automatic creation of a new cluster for every feature branch.
None of these solutions are cost and time-efficient. You need to give team members an environment that isolates the rest of the organisation from being impacted by their work. In the meantime, you want to do this quickly and optimise resources.
This is when Minikube comes to the rescue.
Why Minikube can be the perfect dev environment for you?
If you want to avoid having to pay for a dev cluster, worrying about storing the dev images, and asking the DevOps team to build a deployment pipeline solely for development purposes, you need Minikube.
Minikube is a fully-functioning Kubernetes cluster. The difference is that by default it only has 1 node although it can now be configured to create your cluster with 2 nodes during the init process. At the time of this article, the 2 node option is still experimental.
The default two CPUs and two gigabytes of memory might not be enough for you. Don’t worry, you can configure this as well during the init process like this: minikube start –memory 5120 –cpus=4.
If you have Docker for desktop installed, you have to set the resources there first to make sure you can increase the default memory for example. Use this link to configure the multi-node option.
Here are some of the benefits:
- Supports the latest Kubernetes release (+6 previous minor versions)
- Cross-platform (Linux, macOS, Windows)
- Deploy as a VM, a container, or on bare-metal
- Multiple container runtimes (CRI-O, containerd, docker)
- Docker API endpoint for blazing-fast image pushes
- Advanced features such as LoadBalancer, filesystem mounts, and FeatureGates
- Addons for easily installed Kubernetes applications
Minikube allows you to push/pull images from a private cloud registry (GCR, ECR, etc). Alternatively, with a few commands, you can set up your local registry and use it during the development to build as many images as you want and store them for free.
The add-ons help you to make the development process more efficient and useful. For example, the Minikube dashboard command triggers a local webpage that will show you all the useful pieces of information about your cluster.
It might not be as complex or sophisticated as Grafana or Prometheus but you can draw insights into your deployment without taking extra steps by simply using the add-ons. You can still deploy Grafana or Prometheus to Minikube the same way you would on a production cluster.
First, let us check the Minikube logs to see what is happening inside your cluster. To do that, use the minikube logs -f command. To check out what is going on with any of your pods, you can use kubectl describe <pod-name>. If your webserver or scheduler is not working and you need more information, you can get the logs with kubectl logs <pod_id> web/scheduler -f.
In order to get the logs from your DAGs to make sure you can debug your code, you can choose among three options.
At Infinite Lambda, we use a tool called Lens which is the perfect Kubernetes IDE. It provides a free platform that is easy to use, helping you monitor your cluster and interact with it in a comfortable way. If you work with Kubernetes clusters, Lens is a must-have tool.
The second option is how this deployment operates by default. The repository uses the local logging option. In the background, we use NSF and volumes to make this work.
Last, but not least, since you are already using a cloud provider and possibly storing the logs in an S3 or GCS bucket, there is a way to create a new folder for the dev-logs and store everything there.
For different logging options, don’t forget to check the airflow.cfg in the docker/base image and modify it according to the chosen method.
1.) Log reading with Lens
Just select your pod and then click on the first icon from the left in the top right corner.
2.) Persistent Volume Logging
Kubernetes offers different kinds of Persistent Volumes and each of them is good for a specific purpose. There are different access modes available for different types of volumes. This time, we are going to use the NFS volume, which allows us ReadWriteMany access. For this logging option, you will find a separate folder in the repository, called local-logging.
There are some yml files but don’t worry; just follow the steps inside the README file and you will not have any problems. For this option, what we will do first is deploy a Helm chart with an NFS Server Provisioner.
The next step is to deploy a Persistent Volume Claim. With this solution, we provide an NFS Storage Class to our Kubernetes application with all 3 access options to our persistent volume, which makes it really flexible. If you have a KubernetesPodExecutor then modify your DAG file based on the provided pod_operator_dag.py example.
3.) Remote logging option
For the remote logging option, you first need to modify the airflow.cfg file set the remote_logging variable to True and set the value for the remote_base_log_folder. Once you have deployed the Airflow application to the cluster, just log in to the Airflow UI and create the log connection variable.
- For AWS LOGGING
conn id: MyLogConn / it’s inside the base image airflow.cfg /
conn type: S3
host: bucket name
- For GCP
var name: MyLogConn / it’s inside the base image airflow.cfg /
conn type: Google Cloud Platform
project id: GCP project name / depends on the project /
Keyfile JSON: the service_account.json / upload the actual JSON file /
Environment Variables for Cloud Access
If your script is using any cloud resources, such as reading/writing into buckets and/or reading secrets from the secrets store, you need to provide the credentials to these platforms.
- For GCP
There is an available add-on to help you setting up GCP auth. Here is the link: https://minikube.sigs.k8s.io/docs/handbook/addons/gcp-auth/
- For AWS
You need to provide the AWS keys to your pods. To do this, we are going to set the credential keys during the deployment process as Environment variables. These will exist in the helm/files/secrets/airflow folder as AWS_SECRET_ID and AWS_SECRET_KEY. After this, we can use them inside our DAG files and we can pass them to the KubernetesPodOperators. You can always check the example DAG in the airflow_dags folder and use it as a template.
# somewhere before PodOperator definition
secret_id = os.getenv(“AWS_ACCESS_KEY_ID”, None)
secret_key = os.getenv(“AWS_SECRET_ACCESS_KEY”, None)
# a plus argument for the PodOperator
Let’s get started
Here are the prerequisites:
1.) Install kubectl
To properly install this tool follow the steps in the official guide. Or, on macOS, simply type:
[php]brew install kubectl[/php]
2.) Install helm
To properly install this tool follow the steps in the official guide. Alternatively, on macOS, simply type:
[php]brew install helm[/php]
3.) Install Minikube
To install Minikube to your machine follow the steps or on Mac just type:
[php]brew install minikube[/php]
4.) Get the Airflow on Minikube repository
To get the template please visit the following git repository and clone the project.
How to use it with Airflow DAGs?
As you will see, after cloning the repo, there is a really detailed, almost step-by-step guide to help with the deployment. All the Airflow related requirements and dependencies can be found in the base folder as a separated docker image. If you don’t want to change the default Airflow parameters you can go straight to the docker/dag folder where your custom environment is going to be hosted. Either you have a Docker image or/and the dag descriptor .py files you can find here their place since this follows a standard Airflow structure.
Following the README’s steps, you can go ahead with the deployment, alternatively, if you want to do this easily and quickly, you can use the MiniDeployer which is a simple cli tool designed to make your deployment more comfortable. There are a few manual steps though, but using MiniDeployer you can reduce the number of possible errors.
There is a wide range of tools that you can use to create a local Kubernetes environment. Yet, Minikube is among the most widely used ones because of what it offers you out of the box and the fact that it can be configured in many different ways.
You may be new to Kubernetes and Apache Airflow and want to experiment with these or you may actually need a good development environment to test your code in a safe and cost-effective way. Either way, this post, and the repository are a good place to start.
Get in touch to leave us feedback, we’d love to hear from you.
Don’t forget to check out our blog for lots of insightful content.