Orchestrating scheduled data batch jobs on AWS

Uli Strötz
door2door Engineering
3 min readApr 11, 2017

For one of our latest products, we are analyzing various public transport data and combining it with movement data from different sources. It is a classical ETL process, which, in our case, consists of three batch jobs: Analyzer, Aggregator, and Deployer.

The problem we are trying to solve can be summarized in these two requirements:

  • Orchestrate the three batch jobs in a way that their interdependencies are respected
  • Schedule the three batch jobs to run once a day

The following describes our approach to the problem.

AWS offers a variety of options to create such a data processing pipeline. Our first idea was to create three AWS Lambda functions and orchestrate them with AWS Step Functions. AWS Step Functions is super useful to visualise your processing steps and orchestrate them accordingly. The major downside was the extensively discussed 5 minute timeout limit of Lambda.

We ended up using the new AWS Batch to run our three applications. AWS Batch enables developers, scientists, and engineers to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. We pushed all three applications to a repository on EC2 container service, set up an AWS Batch compute environment and job queue for the entire project, and created a job definition for the three batch jobs. Now we are ready to execute all three of them from the AWS web interface or CLI. The big advantage of AWS Batch is, that it lets you specify job dependencies based on the job ID. This solves our first requirement.

The last missing piece was to schedule them to automatically run once a day. We thought AWS would handle this for us too and we didn’t have to worry about. The product is advertised as:

AWS Batch plans, schedules, and executes your batch computing workloads …

So we expected there was a configurable cron to schedule the jobs. But there we couldn’t find a way to schedule the Batch jobs. The only information the docs provide is this short paragraph describing job scheduling:

Reaching out to the AWS Support center to ask how to run these scheduled jobs only gave us the answer that we can setup a Lambda function to schedule and orchestrate them. We wanted to avoid having another piece of code to maintain. Nevertheless, this was the only feasible option to make them work together. So we tried to keep it as simple as possible in a basic Python script:

Lambda offers very simple scheduled Events (via Cloudwatch). That’s the feature we were expecting out of AWS Batch to make it a complete service. Nevertheless, the simple feature to specify a depending job when submitting a new one makes the service quite useful for our use case. We are looking forward to more features and more regions for AWS Batch.

The complete architecture to orchestrate our scheduled data batch jobs on a daily basis with Lambda is illustrated here:

The above setup fulfills our two requirements. The jobs run sequentially by specifying job dependencies in AWS Batch. AWS Lambda allows us to trigger the process once a day.

--

--