High-demanding cloud computing using AWS Batch & ECS

Published in

Akur8

7 min readJun 22, 2023

Introduction

In this article, we will present how we at Akur8 solved a common task scheduling problem using AWS Batch and ECS Fargate.

About AKUR8

AKUR8 is revolutionizing insurance pricing with Transparent AI, boosting insurers’ pricing capabilities with unprecedented speed and accuracy across the pricing process without compromising on auditability or control.

Akur8's modular pricing platform automates technical and commercial premium modeling. It empowers insurers to compute adjusted and accurate rates in line with their commercial strategy while materially impacting their business and maintaining absolute control of the models created, as required by regulators worldwide.

Use case

Let’s start with a simplified diagram of the AKUR8 platform at its beginning:

The need to do relatively long computations quickly arose, therefore we didn’t want to use the API server for this task. This decision was made to ensure that users' interactions with the platform would not be degraded with large, ongoing computations. Consequently , like many other software companies, we needed to run asynchronous jobs, which in our case was potentially 300+ in parallel.

These jobs can be relatively simple, but they can also be very complex and resource-demanding. For instance, depending on the context and the user database, a single job can consume up to 500 GB of memory and could require a few seconds to several hours to run.

We could not continue to keep large machines running because it would be far too expensive. Instead, we searched for a service that allows running machines on demand, while only paying for its actual use. At that time we were mainly based on AWS, so we decided to use AWS Batch.

AWS Batch Definition:

AWS Batch helps you to run batch computing workloads on the AWS Cloud. Batch computing is a common way for developers, scientists, and engineers to access large amounts of compute resources. AWS Batch removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, similar to traditional batch computing software. This service can efficiently provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce compute costs, and deliver results quickly.

Source: What Is AWS Batch?

Jobs definition

Once we decided to use AWS Batch, we needed to define what will represent a job.
Each job is defined as a sequence of smaller unitary tasks, which is important for parallelization and code reuse.

Currently , we do not use multi-threading to speed up processing. Instead, when possible, we split a large task into parallelizable tasks, and utilize AWS Batch’s ability to spawn multiple machines to run them in parallel (cf. Task 2 & Task 3). One of the main advantages of this approach is gaining the ability to simplify the code base so that we don’t have to manage multi-threading constraints.
Also, if everything was on a single machine using multi-threading, the machine would have to be sized to the most resource-demanding task. If one of the tasks lasted a long period of time, this practice could become very costly.This AWS Batch approach helped optimize machine costs since each task has different CPU and memory needs. The code of each unitary task is packaged into a single docker image that is used by each job.

We modeled the job as a structure that holds the sequencing and state of each task. From there, a solution to schedule tasks 2 and 3 when task 1 terminates had to be found. We could have assumed that the responsibility of the task is to update its state and schedule the next tasks. However, since tasks are subject to multiple kinds of failures, such as “Out Of Memory” errors, it can quickly become troublesome if task 1 fails. (inconsistent job states.)
Therefore, we decided to add a new service to our architecture where we would centralize the scheduling logic and mutations of the job structure. We called it the task scheduler.

Task scheduler

The mission of the task scheduler is to track the status and schedule in the correct order each job task. Below is the previous diagram updated with the scheduler service.

The service databases are separated from the applicative databases.
We designed the service to be event-driven (learn more about event-driven programming here), and currently have 9 different events.

This pattern made debugging and concurrency handling easier (cf. following part on service redundancy). The only way to send an event to the task scheduler is through a single HTTP endpoint. The event would then be pushed to a Redis queue where one thread of the task scheduler is listening.

Below is an example of the events lifecycle of a job:

Each event has its own processing strategy, which mutates the job state and publishes the following events.

Service redundancy

Redundancy is a mandatory constraint when deploying a service in production.. It means at least two instances of the service must always be running, and therefore two services are concurrently dequeuing events. That concurrency factor has to be handled properly, otherwise it can quickly lead to job state corruption. This happens if two events related to the same job are being processed at the same time and write the job state simultaneously. To deal with that, we made use of Redis distributed locks — a distributed lock mechanism that includes the incredible Java implementation Redisson.
With a distributed lock on the job identifier, we ensured that only one event related to that job would be processed at the same time, guaranteeing consistency of the job state.

However, we must be cautious as it has some caveats that are well explained in the Redis documentation. Make sure it suits your use case before using it.

Problems we faced

AWS Batch was slow to start a task
We estimated that the time the AWS Batch took to start the task was 3 minutes. This included finding the machine, booting the machine if it was necessary, loading the docker image, and the time for the process to initialize. This is not a big deal if the task is part of a job that takes much longer to compute, but we had jobs we wanted to run in a few seconds.
Hence, to improve the starting time we reserved a machine on ECS that would only process fast, and low resource demanding tasks It is the responsibility of the task scheduler to submit the task to ECS or AWS Batch depending on the nature of the task.
Properly dimension the computing environment so the databases are not overloaded.
When using AWS Batch you must define at least one compute environment For each environment you must define the maximum amount of resources that can be used.
An issue we had was putting that value too high, hence we had too many jobs in parallel which killed our database.
So, we had to fine-tune the resources allocated to the compute environment to make sure to control the maximum amount of tasks running in parallel, and ensure the database would survive the load.
Unexpected failures
AWS Batch would fail to run the task with the following error: “Essential container exited” before the machine even gets started.
This was slightly problematic due to our design. The task scheduler had no way of knowing that the task had failed because failed events were received by the task itself.
So, we had to add a polling system within the task scheduler that would occasionally check the status of tasks that didn’t get updated for a certain period of time.
Today we’re working on a solution that would make use of AWS SNS/SQS to be able to react instantly to this kind of error. This was made easy by Spring Cloud.

Conclusion

AWS Batch is a great way to run jobs demanding large resources at a minimum cost, without having any machines to manage.
The only drawback is the starting time. Therefore our team is considering using native JVM to improve it, and potentially splitting our single docker image.