Running scheduled jobs on AWS Elastic Beanstalk

Published in

Motorway Engineering

5 min readSep 30, 2021

Motorway started on Heroku, as many start-ups do. It enabled us to move really fast in the beginning with a small team. Heroku abstracts away much of the complexity of a more bare bones hosting solution such as AWS, but these abstractions come at a price. It's a trade-off between simplicity and control over details of your platform.

Since 2020 we have been migrating our microservices out of Heroku to Elastic Beanstalk, which provides a similar paradigm of running our apps and required little to no changes to our code base.

But many of the features we took for granted on Heroku were not so trivial to implement with Elastic Beanstalk. One of them was running scheduled jobs.

Some of our scheduled jobs have some tricky requirements to implement. The most important requirements were:

. Exactly-once guarantee: our jobs must execute once and only once. This sounds trivial but in reality it's not.

. Precision: our jobs need to be triggered bang on time, and we can't wait for a new instance to be initialised. For some specific jobs, we were looking for a precision of less than 1s.

. Time zone support: our jobs need to run in a specific time zone, not UTC.

Those requirements prevented us from using solutions like CloudWatch Scheduled Events as it doesn't have time zone support. On top of that, AWS itself admits that events can be triggered more than once. That level of precision was not good enough for us.

The solutions that didn't work

Search for "scheduled jobs on Elastic Beanstalk" and the results are underwhelming. Even AWS's official doc on the topic has problems. It recommends creating a cron file on your environment, but that means that your job will trigger at the same time on all instances, which doesn't work on a scalable environment. You don't want several instances of your job being spawned simultaneously.

There are of course ways to ensure only one instance of your job is executed. This Stack Overflow discussion for example recommends identifying a leader instance using the leader_only flag in the deploy config file and only creating the cron file in it, but that's absolutely the wrong way to do this because the leader only exists during the deploy phase. Afterwards, there is no such thing as a leader instance and if a scale-in event happens, your leader instance might be terminated and suddenly your scheduled job will not be executed anymore until a new deploy happens.

So you could let all instances trigger the scheduled job and pick a leader, using a lock mechanism (with Redis for example) or an internal Elastic Beanstalk script. But there's another problem with using cron files that defeats this approach: in the case of a scale-in or deploy event, your instance might be removed from the Autoscaling Group and then terminated, but your cron tasks will keep being triggered up until the moment of termination, which means your job might be killed while it's running.

The solution that works

Given all these issues, we opted for an approach consisting of 2 main pieces: a scheduled job triggerer microservice, and a library to ensure the jobs can finish in peace without fear of a sudden termination.

The scheduler microservice

The main reason for a new microservice responsible only for triggering scheduled jobs via http calls is to make sure the instance chosen to run the job is not marked for termination due to a scale-in or deploy event. When one of these events happen, Elastic Beanstalk will first remove the instance from the Autoscaling Group and drain all connections from it. This means that when the scheduler requests an application to run a job, we can be sure the instance is not being scheduled for termination.

It also gives us an easy and centralised way of configuring our jobs. We used the node-schedule package which has a cron-like syntax and timezone support out of the box. Having our own microservice also gives us the option of easily building an interface for it that makes it easy to adjust, pause or force-trigger jobs without any code changes or having to ssh into an instance for a manual trigger.

Our scheduler microservice also suffers from scale and deploy events and because of rolling deploys, it's possible that it will have more than 1 instance active at any moment. So we used a lock mechanism with Redis to ensure that even if multiple instances have a schedule triggered, only one will acquire the lock and actually make the request.

We then added a new endpoint on all our microservices using our internal API framework library that receives the request, spawns a new process to execute the script (we use Node.js and we don't want to interfere with the main thread) and logs the outcome.

Termination protection

But even if we request the job before an instance is marked for termination, we can't guarantee it won't be marked for termination immediately after our request is accepted, and if the job doesn't finish quickly, the instance termination might happen before the job completes.

To eliminate this possibility, we took advantage of EC2 Lifecycle Hooks. Whenever an instance is scheduled for termination, it's first removed from the Autoscaling Group and then, if lifecycle hooks are enabled, EC2 will send a notification requesting permission to terminate the instance.

When our applications start a task, they first acquire an instance termination lock. So if an instance has an active lock and our system gets notified about EC2 wanting to terminate it, it will simply not allow the termination until the task is done. Once the task finishes, we know it won't receive any new requests because it's already removed from the Autoscaling Group, so we're clear to allow the termination.

In conclusion

As I said in the beginning, it's a trade-off between simplicity and control. This is in no way the only possible solution, but it checks all the boxes in our requirements list and gives us the option of expanding it in the future.

Our scheduler microservice can be extended to not only trigger jobs in other applications but trigger automatic communications or any other task that must happen in a scheduled manner.

Our instance termination protection can now be used to protect any task, like one-off scripts that might take a longer time to complete.