Provisioning a Kubernetes cluster on Amazon EKS — Part II

10 min readFeb 22, 2023

AWS EKS, Terraform, and Airflow

1 — Introduction

In part I of this article we covered the fundamentals of Kubernetes. We saw how the project and architecture came about, some of its elements, and how to interact with a local cluster. At that time we created a local cluster using the k3d tool. Using the kubectl CLI (command line interface) we saw how to explore and manage the cluster. Although very illustrative, we know that in professional situations cloud services are frequently used to host the Kubernetes cluster. In this part, we will continue our learning about Kubernetes and, at this time, we will bring the experience closer to what is done in real environments. We will use Infrastructure as Code (IaC) to provision a Kubernetes cluster on AWS and then deploy the Apache Airflow.

2— Infrastructure as Code

The use of IaC for infrastructure provisioning in cloud providers is a widespread practice in Data Engineering. The main cloud providers have their own initiatives such as AWS Cloud Formation, GCP Deployment Manager, and Azure Resource Manager. While all initiatives serve their purpose quite well, open-source initiatives have gained a lot of visibility in recent years. This is largely due to the fact that they are able to provision infrastructure on any cloud provider. For this reason, they are also called Agnostic technologies. Terraform is one of the main representatives of this group. Maintained by Hashicorp, it is characterized by using a declarative language to provision infrastructure. It is not the intention of this text to go deep into Terraform, however, for those who are not yet familiar with the tool, the documentation is quite extensive and straightforward. You can also read Gabriel Bonifacio’s “IaC com Terraform — Start and Hands-on” (PT-BR).

Provisioning some cloud services can be a bit complex due to a large number of dependency files. To ease the deployment process, Terraform has developed more robust structures called modules. Terraform modules consist of a set of files that are grouped to perform a specific task. EKS, VPC, and RDS are some examples of AWS services that already have Terraform modules developed.

Suppose you want to deploy an RDS instance to Amazon. To accomplish this task you would have to create a replication instance, endpoints, and migration tasks. Using the aws-rds module you can achieve this task using fewer .tf files. The EKS module is no different. In the next section, we will give an overview of the terraform files used in this project for more details go to the Github repository.

2.1 — Terraform files for infrastructure

Note that vpc_id and subnets are passed as variables. They already exist in the AWS account that we are using for this exercise. It is quite common to deploy a dedicated VPC for the cluster however, in this project, we had already reached the maximum limit of VPCs in the AWS region. So from the AWS console itself, we got the values of vpc_id and subnets. The variable cluster_version is the version of Kubernetes on AWS, in this case, 1.21.

The cluster can be seen in root_volume_type where we set storages of type general_porpuse (gp2) and will use two machines of type t2.medium (2vCPU and 4 GiB). t2 machine type consists of a general-purpose AWS instance and fits very well for the purposes of this case.

Another important file is storage.tf (see Figure 2). It is used to provision the bronze and silver zone buckets of our infrastructure.

Note that in the bucket declaration, we have a slightly different structure. This is because when we define the bucket_function and bucket_name variables we pass an array of types :

bucket_names = ['bronze','silver']bucket_function = ['lake','logs']

when we set bucket_function to 0 we are only provisioning the buckets that will have a lake function (belongs to the data lake). For the purposes of this project, we will not provision log buckets although it is strongly recommended. For example, Airflow logs can be stored in this bucket. The bucket_names are generated according to the array. Note that in the bucket_names argument the count.index is passed, which runs over the index of the array. To access the other files, visit the repository

2.2-Building the infrastructure

Once we have all the terraform files we can proceed in the usual way by doing terraform init. In this step terraform initializes the providers and downloads all dependencies for the EKS module from its “terraform-aws-modules/eks/aws” repository. As shown in Figure 3, the init is successful.

Terraform validade,

and then the terraform plan. In the plan step, we have to pass a .tfvars file as an argument(see Figure 5). The .tfvars files are very useful when we have different environments homologation and production for example. In this case, we can use one homolog.tfvars with a more modest structure and another production.tfvars containing a more robust infrastructure that will be the production environment.

Figure 5: Begining terraform plan outputs

…….

Once we have approved the execution plan (see Figures 5 and 6), we can deploy the application via terraform apply (see Figure 7).

Perfect! The deployment of the infrastructure, buckets, and Kubernetes cluster has occurred successfully. The provisioned resources are shown in the AWS console.

As shown in the eks-cluster.tf file we created two nodes of type t2.xlarge.

The infrastructure of our project has been deployed in the right way. We already have an active Kubernetes cluster to work with. In this AWS account, there is only one EKS cluster deployed as shown in the image below.

Figure 10: Listing EKS clusters from AWS account

We now need to associate my local machine with the EKS cluster (context). In other words, we need to add the parameters of this cluster to the .kude/config of my local machine. This is done via the AWS CLI as shown in Figure 11. In this case, we can also say that a context has been added to the machine.

Figure 11: Adding a new context to my local machine

With context added to the kubeconfig, we can use kubectl to manage our cluster. Note that nodes match the ones listed in the AWS console.

The VERSION v1.21 refers to the cluster_version variable in eks-cluster.tf . The command — kubectl describe nodes — can also be used to check the deployment of the nodes.

At this point, we have the cluster ready for use. Let’s deploy one of the most used tools in Data Engineering, Apache Airflow.

4— Deploying Airflow

For reasons of convenience, I created an alias of the form: “alias k=kubectl”. For all intents and purposes, k and kubectl are the same things. The first step is to create a namespace dedicated to the Airflow, as shown in Figure 13.

Figure 13: Creating namespace for Airflow

We see that the namespace airflow has been successfully created. As no deployments have been made in this cluster, the namespace is empty at this point

Figure 14: Listing all resources in Airflow namespace

To define, install, and upgrade Kubernetes applications, we frequently use a package manager for Kubernetes called Helm (Kubernetes literally translates to “Helmsman”). The Helm charts correspond to a YAML manifest that allows the user to deploy applications more easily. A Helm chart is composed of one or more Kubernetes manifests and is able to install the application and all its dependencies. The Apache Foundation provides an official helm chart for Airflow. To do this, we first need to add the repository,

Figure 15: Add Airflow repository

Saving the helm chart in values.yaml to customize Airflow.

Figure 16: Saving default values file in values.yaml

Note that we have downloaded the values file in its standard form. We need to customize it according to our needs. From now on, I ask you to follow this text with the project repository at hand. I will always refer to the lines in the k8/airflow/values.yml file. The first change we need to make is to change the executor (line 230). By default, this line comes with CeleryExecutor, but we want KubernetesExecutor. The next step is to change some environment variables to allow remote logging (see line 239). Note that in the variable AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER we pass the URI of the directory that will be used as the directory for the airflow logs (see line 243). The last environment variable to create is the link variable that we will call aws_conn.

The next thing we’ll change is the fernetKey. A fernetKey is used by Airflow to encrypt the information that is in our cluster. To do this, we will create a Python file to generate the key. There is already a Python library to generate this type of key,

The next set of changes will occur in the defaultUser block in Airflow Web Server (line 814). The default values will be changed as needed, except for the password. We will change its value on the first accesses to Airflow. Moving on, we will go to the service block on line 914. There we have the default value for the type variable ClusterIP however, this is actually only used when we are doing a local deployment and the service is exposed through port-forward in our machine itself. This is not the case. We want to export this service to the internet so that we can access it through a URL. For this, we change the value to LoadBalancer.

The next change to be made is in the Redis block. The enabled field is set to ‘true’ by default, but this is not the case here. As we have set kubernertesExecutor we will not need to use Redis for Data Store. The last and very important change to make is in the dags block (see line 1554) and more specifically in gitsync (see line 1569). Let’s set this variable to true. The Airflow dags will be in a git repository, so we want to synchronize with this repository. On line 1576 we should change the repository name to our repository that will contain the DAGs and then the branch we are working on. In this case, it is the main branch.

In this case, I am using an open repository however, if I had (and this is strongly recommended) private repositories, I would need to create secrets as shown in lines 1593 and 1594. I would also need to uncomment line 1597 and pass the secrets stored in EKS. For the purposes of this case, I will not use private keys.

At this point, I have the values file configured and we can deploy it. To do the deployment we use the Helm CLI that works like this,

helm install <release name> <repo>/<chart> -f <values file> -n <namespace>

The deployment seems to have taken place without hindrance. Let’s take a look at the application pods

Perfect! All pods running without problems. To access the airflow we need to get the URL that was generated in the deployment process. Remember that this was possible because we set the type as LoadBalancer in the values file. External URLs correspond to a service in Kubernetes and can be accessed by,

Figure 20: Output of services in Airflow

Remembering that it is good practice to change the password on first access. To do this go to the Security tab > List Users > Show Record > Reset Password. The next step is to add the connection aws_conn (AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID). This connection will allow us to save the airflow logs in AWS. This is done on the Admin > Connection> Add new connection tab.

To establish the connection to our AWS account we will need to fill in the login and password. These values are, aws_access_key_id and aws_secret_access_key respectively.

5-Conclusions

In Part II of our case, we get hands-on with provisioning a Kubernetes cluster on AWS using Terraform. Initially, we did a small introduction on Infrastructure as Code, later we talked about Terraform and then we saw how to deploy a Kubernetes cluster on AWS. Through the AWS UI, we check the details of our cluster. We move on to talk a little bit about the Helm repository and then download the official Airflow helm chart. All changes to the values file were made to meet our needs in this case. We then continued deploying the tool through the Helm CLI. Finally, we access the Ariflow webserver and move on to the connection settings.