How we used an AWS SQS to Optimize our Integration Jobs

Christopher Malloy
7Factor Software
Published in
5 min readJul 28, 2021

When building applications, it’s not often that you have to be concerned with memory optimizations and the size of incoming data. So when the challenge arises, it’s exciting but can also be a bit daunting, simply for lack of practice. Once you understand the resource needs of the app, or at least the fact that you have to account for it, you can begin solutioning and architecting for the problem at hand. I’d like to walk you through a story of how we ran into such a resource problem, how we solved for it, and how our implementation ultimately played out. The journey takes us from an unstable ELT job that had frequent Out Of Memory errors and took over two hours to run, to a stable 20-minute process.

At one of our healthcare clients, we needed to build a process that would ingest a CSV file, perform some operations on the data, and send this processed data to a third-party data source. Essentially, ETL. As we did our initial build-out, we were testing will CSV files that had anywhere from a few dozen to a few hundred rows. No big deal. And we were chugging along writing the code without a care in the world. The simplified deployment implementation of this job was a docker container hosted on AWS ECS behind a Load Balancer, intended to be a hot and running API-based job. Whenever the job would need to be kicked off, an API route was called and the job would do its thing.

As we began our first test runs with files we would be using in production, we began to notice 503 errors coming back from the API. This had not occurred a single time during months of local testing, and only started happening in the deployment inside ECS, so we were puzzled. After some investigation, we discovered that the ECS Container Agent was reporting “Out of Memory Exception” and spinning down the container. This means that the response back to the LB would timeout, and report a 503 bad gateway.

This is obviously bad, but the reasons go deeper than just the obvious “it’s broken”. Because this meant that only a subset of data would get loaded to that third-party destination, and the jobs were not designed to re-run without re-sending data. This would result in duplicate data entries in the target data source, and since this was related to the payroll division of the company, that meant that some employees’ paychecks were now incorrectly doubled. A file could cover tens of thousands of employees and hundreds of thousands of dollars for the company. While I’m sure the employees would have been happy to receive double pay, this was obviously a massively destructive bug in the operation that we needed to resolve ASAP.

We certainly need to focus on memory optimizations in our job, but I want to focus on how we solved the duplicate entry error. Due to the load these containers would be under, there was always the possibility one of them would fall over. Furthermore, since we had to upload one transaction at a time, there was always a chance that a break would occur in the request/response cycle, causing a failure to load a transaction.

While we would like to say we could code these reasons away, hope is not a good development strategy, so the solution was to store the data to be sent in intermediary storage location, to be then subsequently sent to its destination. What would also be ideal, is if multiple containers could process the data in the holding area, however, we would have to ensure that no two or more containers would pick up and send the same transaction. That would bring us back to the duplicate entry problem.

So these are our requirements, now to pick the tooling. A mysql database or maybe even a file storage location may seem like the first obvious solution, but these are not ideal as they are better suited for permanent storage of data, and what we really need here is just a temporary holding location. Not to mention we need a way to keep a record of processed transactions, and that is usually implemented with a lock table in a database somewhere, but that’s cumbersome as a whole table has to be built and maintained, and it’s another integration point which could fail. So what would be a solution that would solve all these problems at once?

Enter Simple Queue Service, or SQS. A queue is the perfect solution to our problem. It is by design a temporary storage solution that can hold messages up to 256kb in size. And it solves the duplicate transaction problem simply by the nature of what a queue is: each instance of the job will pull from the queue and when a message is in progress of being pulled, no other instance can pull it. It is effectively locked at that point. From here two things can happen, either the instance successfully processes the message and removes it permanently from the queue, or it fails for whatever reason and is put back into the queue.

That last point is very special and is the crux of why a queue is so valuable. This means that we have a built-in re-try mechanism at the infrastructure level. The job itself is totally unaware that other copies of itself are running alongside it. It is also totally unaware of how many messages it will end up processing, or which ones it will process. It doesn’t matter. Each instance simply needs to exist and churn away, pulling and messages it finds, until there are no messages left to process, at which point the job containers will go back to lying dormant, periodically asking SQS if there are any more messages in the queue to process. This solution is bulletproof to containers crashing out for whatever reason, as well as the final destination becoming unreachable for whatever reason.

Besides drastically stabilizing the system, this solution proved to reduce our average runtime from 2 hours to 20 minutes. And when we only have one day to send all of the payroll data over to the client so they can have checks cut and ready for the next day, every minute counts.

The conclusion is a queue is oftentimes a fantastic solution for decoupling your architecture. This reduces bottlenecks, points of failure, and dependencies. Just like you might write an abstraction at the software level, this is almost like an abstraction at the infrastructure level. The next time you have microservices communication with other data sources or ingest points in a system, consider spreading the load with SQS.

--

--