How to make a parallelized cron job in AWS, for the laziest programmers
I had an assignment: copy our operational database to our reporting database every night, then send a status message to Slack. I could use any tool deemed fit. What would be the laziest?
Step 1: Put your configuration and secrets (e.g. database hostname and password) in AWS Parameter Store. It takes a few clicks, if it’s not already there.
Step 2: Make a script using the language of your choice. It should use the AWS SDK to get the configuration from Parameter Store. It should get a list of database table names to copy from an environment variable. Any error should raise an exception to stderr and exit(1), because errors must not pass silently.
Step 3: Put the script in github.
Step 4: Create a CodeBuild job that checks out the git repo and runs a buildspec.yml. The buildspec is about 10 lines: install dependencies, then run the script. Use the standard CodeBuild-maintained docker image. You can easily configure the job to run in a VPC.
Step 5: Create a Step Function as shown in the image below. It executes a fixed number of CodeBuild jobs in parallel (using “.sync” so they are awaited until completion), and a list of table names is passed to each job via an environment variable override.
Step 6: Curse at AWS Chat Bot for being so useless that it can’t send arbitrary SNS messages to slack. The “Notify Success” and “Notify Failed” states can’t just send to an SNS topic, nope, you have to create a custom Lambda just to send a plain text message to a slack channel. Sigh.
Step 7: Schedule the Step Function with CloudWatch Events and a cron expression, just like you would a Lambda.
Step 8: You’re done! Enjoy starting the step function and watching the flow chart change colors in real time as jobs finish. You can click on any of the launched jobs to see them in the CodeBuild console, and see their log output. You can even search all jobs’ combined stdout/stderr with CloudWatch Logs Insights.
Our load takes 45 minutes, using 5 parallel jobs. In the future, we might invoke an additional “Transform” step function after the load phase is done. We may start the step function by passing the current date as its name, to ensure only one invocation happens per day (StartExecution is idempotent).
Another really great thing about step functions is they can drive a Lambda toward success. For example, if you have 1000 (or 1 billion) records to scan, you could process them in id order in small batches, until the Lambda returns “no more records”. The only mutable state you’d need is a single variable in the step function, “last_id”, to handle the iteration. You no longer have to worry about potential error cases with SQS or Lambdas trying to re-invoke themselves to keep a chain going. Basically, step functions are easy, cheap, and reliable — a winning combination.
We use CloudWatch Logs Insights to search all the scripts’ stdout and list how long each table took to copy. We can rebalance the table assignments if needed.
The weakest link here is Github. If it’s down, your job can’t start. The next weak link is the CodeBuild provided docker image. We rely on that not breaking. I’d estimate the reliability of this solution is 99%, and we can do better. And last I checked, CodeBuild is 3x the cost of an EC2 c4 instance.
To improve reliability, have Code Build launch a custom docker image from ECR, with no source code download. It’s super easy, a vanilla image is fine and no agent is required.
To save money, consider AWS Batch. Like CodeBuild, it is easy to configure, will launch your docker image into a VPC environment of your choice, can be orchestrated by Step Functions, and logs go to Cloudwatch. There is no extra charge, you only pay for EC2 usage, and you pick exactly what instance type(s) will run. I don’t see a reason to use Fargate other than faster startup; it’s more expensive and no easier to configure.