Automating performance marketing workflows at Miro with AWS EKS

Published in

Miro Engineering

8 min readJun 14, 2023

In this post, find out why and how Miro’s performance marketing team improved efficiency by moving its workloads from one AWS service to another.

Why we left AWS Lambda behind

One of the primary objectives of Miro’s performance marketing team, where I work as an engineer, is to acquire new users and paid plan subscribers through paid advertising, brand marketing, and affiliate advertising. Because these tasks involve a lot of manual work, we strive to automate as much as possible. As a result, we had been using AWS Lambda to execute multiple daily tasks.

Recently, however, we’ve had to consider moving from AWS Lambda to another service that could withstand our heavy workloads. The main bottleneck was that Lambda functions timeout after just 15 minutes. While some of our jobs do not need to operate with a lot of data, some rely on the data we get from our data warehouse, which can exceed tens of millions of rows. Processing that amount of data and sending the result over HTTP is time-consuming.

Our goal was to be able to run scheduled jobs and create them programmatically. After weighing the alternatives to AWS Lambda, we considered two other native AWS solutions: Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS).

Pros and cons of AWS ECS

Amazon ECS tasks allow running Docker images and scheduling runs using AWS CloudWatch.

Pros:

Logging: Writes Logs to Cloudwatch natively from the box.
Moving complexity: Scheduling from CloudWatch rules (same as for Lambda).
Cost-effective: ECS tasks can be cost-effective for smaller projects, as users only pay for the resources they use, such as CPU and memory.

Cons:

Complex scheduling: Many settings need to be provided during CloudWatch rule creation (like cluster, task definition, VPC settings, etc). If we want to create rules programmatically, we will have to store references to those resources.
Limited customization: ECS tasks offer fewer customization options than other containerization technologies, such as Kubernetes.
Limited deployment options: ECS tasks are limited to AWS infrastructure, which may not be suitable for all use cases.
Housekeeping: We already have some applications deployed on EKS, so it would be good to have a standard way of dealing with deployments.

Pros and cons of AWS EKS

Amazon EKS utilizes Kubernetes CronJobs to run Docker images.

Pros:

Interoperability: EKS is fully compatible with Kubernetes, which means that you can use any Kubernetes tools, plugins, or applications with EKS.
Customization: EKS offers extensive customization options, allowing users to fine-tune their deployments to fit specific use cases.
Portability: When moving from EKS to on-premise Kubernetes, you don’t need to change much in your deployments.

Cons:

Logging: Does not write logs to CloudWatch natively (needs additional config).
Cost: EKS can be more expensive compared to ECS tasks, as users have to pay for the Kubernetes control plane in addition to the resources they use.

Choosing and using AWS EKS

After playing around with AWS ECS, we decided to choose AWS EKS instead since it’s a de facto company-wide engineering standard (while ECS can still be used and works perfectly for the task).

Before going further, we reviewed the prerequisites for accessing the Kubernetes API on AWS EKS. We needed:

A running EKS cluster
An AWS IAM role with the necessary permissions to access the Kubernetes API

Luckily, we already had these in Terraform and could proceed to the next step.

Authenticating to the cluster

To authenticate, we can use any Kubernetes API client we want. Since we work with Kotlin, I picked the official Java client library.

The API is quite straightforward and easy to follow. We create a client that is used for making API calls:

val client = ClientBuilder.defaultClient()
io.kubernetes.client.openapi.Configuration.setDefaultApiClient(client)

But how does authentication in the client work? Looking at the definition of defaultClient method we see that it uses standard method under the hood.

public static ApiClient defaultClient() throws IOException {
  return ClientBuilder.standard().build();
}

From the docs:

Creates a builder which is pre-configured in the following way
If $KUBECONFIG is defined, use that config file.
If $HOME/.kube/config can be found, use that.
If the in-cluster service account can be found, assume in cluster config.
Default to localhost:8080 as a last resort.

We have several ways of doing it inside the docker container. Let’s explore some of them: generating a config file and providing KUBECONFIG env variable with the config file path or using the definition from $HOME/.kube/config.

Note: We have our service deployed in AWS Elastic Beanstalk. If you deploy your service in EKS you can use RBAC Authorization in AWS.

We need to have this config during the runtime, so we have to put this config inside the container in any case, so option 2 is more attractive since we can generate that config inside Dockerfile.

The Dockerfile

I run my jobs as Java jars built by Gradle. So I will need an intermediate container for building my jar. Then I will install all necessary tools for generating .kube/config file - awscli and kubectl.

I need to provide an access key id, secret access key and aws region to successfully authenticate my user via awscli.

A cluster name is needed to change the current EKS cluster which will be used by API as the default one.

#This is a temporary container for building executable jar file
FROM gradle:7.6.0-jdk17-focal as builder
USER root
WORKDIR /builder
ADD . /builder
RUN gradle build

#This is a main container to run our application 
FROM amazoncorretto:17.0.7-alpine3.14
WORKDIR /app
EXPOSE 8080
COPY --from=builder /builder/build/libs/app.jar .

RUN apk --no-cache add curl

#Installing awscli
RUN apk add --no-cache \
        python3 \
        py3-pip \
    && pip3 install --upgrade pip \
    && pip3 install --no-cache-dir \
        awscli \
    && rm -rf /var/cache/apk/*

#Installing kubectl
RUN curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
RUN install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

ARG AWS_ACCESS_KEY_ID
ARG AWS_SECRET_ACCESS_KEY
ARG AWS_REGION
ARG K8S_CLUSTER_NAME

#Configuring aws user to be able to get auth token
RUN aws configure set default.region ${AWS_REGION}
RUN aws configure set aws_access_key_id ${AWS_ACCESS_KEY_ID}
RUN aws configure set aws_secret_access_key ${AWS_SECRET_ACCESS_KEY}

#Configuring cluster config
RUN aws eks update-kubeconfig --region ${AWS_REGION} --name ${K8S_CLUSTER_NAME}

#Running our application
CMD ["java", "-jar", "app.jar"]

The docker command to build the image looks like this when I execute it from the Github Actions workflow:

docker build — build-arg AWS_ACCESS_KEY_ID=${{ secrets.AWS_ACCESS_KEY_ID }} — build-arg AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }} — build-arg AWS_REGION=${{ env.AWS_REGION }} -t ${{ env.IMAGE_TAG }} .

In this case, I take AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from GitHub Action secrets. AWS_REGION and IMAGE_TAG I get from environment variables I define inside Github Actions workflow yaml file.

Coding time

Let's define a small data class that will hold all parameters we want our scheduled job to have.

package com.miro.marketing.model

data class SchedulingJobDTO(
    val name: String,
    val appId : String,
    val enabled: Boolean,
    val scheduleExpression: String,
    val inputParameters: String
)

Below you can see a Spring service that fetches, creates, and updates Kubernetes cron jobs.


package com.miro.marketing.service

import com.miro.marketing.model.SchedulingJobDTO
import io.kubernetes.client.openapi.apis.BatchV1Api
import io.kubernetes.client.openapi.models.*
import org.springframework.beans.factory.annotation.Value
import org.springframework.stereotype.Service

@Service
class KubernetesApiService {

    @Value("\${ecr.container.registry}")
    private val containerRegistry: String? = null

    @Value("\${eks.cluster.namespace}")
    private val clusterNamespace: String? = null

    fun listCronJobs(appId: String): List<SchedulingJobDTO> {
        val list = BatchV1Api().listCronJobForAllNamespaces(
            null, null, null, "app=$appId", null, null, null, null, null, false
        )
        return list.items.map {
            schedulingJobDTO(it)
        }
    }

    fun getCronJob(jobName: String): SchedulingJobDTO {
        val job = BatchV1Api().readNamespacedCronJob(jobName, clusterNamespace, "true")
        return schedulingJobDTO(job)
    }

    fun createCronJob(job: SchedulingJobDTO) {
        BatchV1Api().createNamespacedCronJob(clusterNamespace, cronJob(job), "true", null, null, null)
    }

    fun updateCronJob(job: SchedulingJobDTO) {
        BatchV1Api().replaceNamespacedCronJob(job.name, clusterNamespace, cronJob(job), "true", null, null, null)
    }

    private fun cronJob(job: SchedulingJobDTO): V1CronJob {
        val body = V1CronJob()
        body.metadata = V1ObjectMeta().name(job.name).labels(mapOf("app" to job.appId))
        val container = V1Container().name(job.name).image("$containerRegistry/${job.appId}:latest")
            .env(listOf(V1EnvVar().name("input").value(job.inputParameters)))
        val jobTemplateSpec = V1JobTemplateSpec().spec(
            V1JobSpec().template(
                V1PodTemplateSpec().spec(
                    V1PodSpec().containers(
                        listOf(container)
                    ).restartPolicy("OnFailure")
                )
            )
        )
        body.spec(
            V1CronJobSpec().jobTemplate(jobTemplateSpec).schedule(job.scheduleExpression).suspend(!job.enabled)
        )
        return body
    }

    private fun schedulingJobDTO(job: V1CronJob) = SchedulingJobDTO(
        name = job.spec!!.jobTemplate.spec!!.template.spec!!.containers[0].name,
        appId = job.metadata!!.labels!!["app"]!!,
        enabled = !job.spec!!.suspend!!,
        scheduleExpression = job.spec!!.schedule,
        inputParameters = job.spec!!.jobTemplate.spec!!.template.spec!!.containers[0].env!![0].value!!
    )
}

And a small controller to test the API.

package com.miro.marketing.controller

import com.miro.marketing.model.SchedulingJobDTO
import com.miro.marketing.service.KubernetesApiService
import io.kubernetes.client.openapi.models.*
import org.springframework.web.bind.annotation.*

@RestController
@RequestMapping("/api/scheduler")
class SchedulerController(val kubernetesApiService: KubernetesApiService) {

    @GetMapping("/")
    fun cronJobs(@RequestParam appId: String): List<SchedulingJobDTO> {
        return kubernetesApiService.listCronJobs(appId)
    }

    @GetMapping("/job/")
    fun getCronJob(@RequestParam jobName: String): SchedulingJobDTO {
        return kubernetesApiService.getCronJob(jobName)
    }

    @PostMapping("/create")
    fun createCronJob(@RequestBody rule: SchedulingJobDTO) {
        kubernetesApiService.createCronJob(rule)
    }

    @PostMapping("/update")
    fun updateCronJob(@RequestBody rule: SchedulingJobDTO) {
        kubernetesApiService.updateCronJob(rule)
    }
}

Testing time

Let’s test /create endpoint to see if k8s authentication works.

curl --location 'http://localhost:8080/api/scheduler/create' \
--header 'Content-Type: application/json' \
--data '{
    "name": "test-scheduled-job",
    "appId": "test-app-id",
    "enabled": "true",
    "scheduleExpression": "2 * * * *",
    "inputParameters":  "{\"dryRun\": \"true\"}"
}'

And we get 200 HTTP response from our controller. Now is the time to check the cluster.

kubectl get cronjobs

We see that our job has been created!

Bonus pitfall

While developing, I utilized the Spring bean to instantiate the ClientBuilder.defaultClient, which generates a singleton instance of the class that is accessible throughout the application. However, I was unaware that the API token was hardcoded in the background, causing my endpoints to start throwing 401 HTTP errors after 14 minutes. Upon inspecting the official library, I discovered a TODO statement indicating that they intend to cache the token and obtain a new one in case of a 401 error, which I assumed was the default behavior.

https://github.com/kubernetes-client/java/blob/52404ee5f68f4576ecc2da499e825241577e93ef/util/src/main/java/io/kubernetes/client/util/KubeConfig.java#L319

So I had to cache the token myself since its acquisition takes around 1.5 seconds for me.

object K8sClient {

    private const val CACHE_TTL = 10L // in minutes

    @Volatile
    private var lastTokenDate: LocalDateTime? = null
    private lateinit var client: ApiClient

    fun apiClient(): ApiClient {
        if (lastTokenDate == null || lastTokenDate!!.plusMinutes(CACHE_TTL).isBefore(LocalDateTime.now())) {
            synchronized(this) {
                if (lastTokenDate == null || lastTokenDate!!.plusMinutes(CACHE_TTL).isBefore(LocalDateTime.now())) {
                    client = ClientBuilder.standard().build()
                    lastTokenDate = LocalDateTime.now()
                }
            }
        }
        return client
    }
}

This is a basic wrapper for your ApiClient that generates a fresh instance in case 10 minutes have elapsed since you last created it.

Subsequently, you will need to provide your apiClient to all API calls in the code and remove the setDefaultApiClient(client) method.

BatchV1Api(apiClient())

Concluding thoughts

In the end, we have a working Docker container that is able to create, update, and list all scheduled jobs in Kubernetes. We can even create a cron job via our container that will create more such containers! But we are not crazy and won’t do that…today.

Do you see any flaws in this solution? Could you suggest an interesting alternative? Or improve Dockerfile? Chime in with a comment below!

Are you interested in joining our team? Then check out our open positions.