Background Job Processing at Scale Using AWS lambda and SQS

Dhaval Chaudhary
Fasal Engineering
Published in
6 min readApr 23, 2021
source: Amazon AWS

At Fasal we run multiple background jobs that process millions of data points coming from thousands of IoT devices deployed on the farm. Some of these jobs do micro-climate prediction, disease prediction, pest prediction, generate and send alerts, generate and send weekly reports, etc.

These are mission-critical jobs for us and we can’t afford it to go wrong at any time. We have always built systems that are fault-tolerant, scalable, reliable, and have high throughput. When we started in early 2018, we had implemented a worker-based approach where there were workers running background jobs in coordination with other workers that were running in the system. This approach has worked for us for some time but with the scale, we faced certain challenges —

  1. Uneven work distribution among workers.
  2. No separation of environment for multiple jobs running on same worker machine. So if one of the job workers goes down, the other jobs running on the same machine get impacted.
  3. Limitation on parallelism. Every job has to wait for their turn to execute.
  4. High-cost burden because workers are running all the time.
  5. If there’s a change in a single job, the worker node has to be deployed again. Unable to update single job individually.
  6. Monitoring was tough and so on….

So we decided to move out from this approach and build something that can help us with separation of concern, parallelism, easy to monitor, and cost-effective.

The New Approach

In the new approach, we divide the background jobs into 3 core components —

  1. A scheduler
  2. A headless compute worker and a queue
  3. DB connection manager

Let’s talk more about each of these components.

Overview of full system

Scheduler

The only job of the scheduler is to generate events and push them to SQS.

A scheduler can be implemented in any programming language with any framework of choice. But you must take below into consideration —

  1. Timezone support: Scheduling should be easy with timezone support. e.g. Let’s say you want to send daily reports of the app to all the users at 6 AM. Scheduling jobs that run at 6 AM in the morning in the particular timezone (all the supported timezones) should be easy and machine-independent. Your machine in which Scheduler is deployed can be in any timezone. For better understanding take a case where my scheduler is deployed in us-east-1(N. Virginia) but still able to schedule a job exactly at 6 AM IST(Asia/Kolkata).
  2. Job-specific filters: Scheduler should be capable of doing all the filtering before generating and pushing events to SQS. e.g. In the above example, we only want the report to be sent only to paid customers. The schedular should have the capability to filter only paid user events from all events and then push only paid user events to SQS to be processed as headless workers should just take the event and process as it is sent. All the business logic filtering should be taken care by the scheduler.
  3. Job scheduling monitoring: You should have proper monitoring in the place where you can know if a particular scheduling task is completed as expected. A proper logging mechanism can be a good help here but having a proper method will be good to have a feature. e.g. You have multiple 6 AM reporting scheduling tasks for 5 different timezones and you want to see if all 5 tasks were completed successfully.

A headless compute worker and a queue(AWS Lambda and SQS in our case)

This is the main part where all your business logic resides — like how your event will be processed and all data processing logic.

This Lambda worker will get trigger as soon as there’s an event available in a particular SQS topic. We are using the serverless framework for the implementation of our headless compute worker. This worker can be written in any language or framework of your choice. For the support of different language support, you can check a particular cloud provider’s documentation.

Take care of the below key consideration if you are going with AWS Lambda and SQS approach —

  1. Timeout and VisibilityTimeout: Both values totally depend on your business logic but having a fair idea about it is necessary. AWS Lambda can’t support long-running tasks so headless workers should be short-running jobs.
  2. Good batching strategy: A lambda can batch process events from SQS but a batch of messages succeeds or fails together. So it’s good to have a failure handling mechanism. We are using https://www.npmjs.com/package/@middy/sqs-partial-batch-failure in our use case in node-based workers. This helps us to re-queue only fail events and remove all processed events so event duplication never happens. AWS Lambda also supports batch windows of up to 5 minutes for functions with Amazon SQS as an event source and it helps in cost optimization where your events are not timebound or time-sensitive.
  3. Reserved concurrency: This number will control your parallelism for workers and it will help you reserve the number of Lambda for a particular task because each AWS account has a limited number of lambda concurrency reserved like 1000 at the start. The Lambda workers will be throttle if parallel executions go beyond that. You can reach out to the AWS support team to increase the number for the account. Also, this setting is useful when you don’t want your downstream services like DB or API server to get flooded. Lambda will scale exponentially if events are available in the queue and there no restriction on the concurrency.
  4. Memory size: This value is totally dependent on the kind of worker you are writing. You can always optimize your cost component by adjusting the memory size. Sometimes higher value might have a lower cost component attach to it because it can finish the execution really fast. You choose carefully.
  5. Max Receive Count: This is the configuration that decides — how many times an event gets re-queued if it fails while processing or throttled. If the value exceeds the given number it will automatically be sent to a dead-latter queue where you can manually process it later if you want.
  6. Message Retention Period: This configuration is used for the TTL of the messages in the queue.
  7. Monitoring: Monitoring becomes crucial as there are so many lambda functions getting executed in parallel. We use a tool called Lumigo Which helps us monitor all the metrics like failures, full event processing timeline, cold starts, maximum invoked functions, failure rate, time taken by functions, the latency of external services, and so on…
  8. Logging: We use elastic search (ELK) for our logging needs and push all logs to elastic search where they get indexed and we can query on them. For pushing logs to elastic search we have another worker which gets triggered by cloud watch. You can follow a similar approach and ship your logs where ever you want.

DB Connection manager

The idea of Lambda is fascinating until you flood your downstream services like your DB with lots of connection or API requests. To take care of this issue we have to design a reliable DB connection manager. The job of the connection manager is to protect your DB from having so many open connections from Lambda workers. All the lambda functions will query DB via this connection manager and it will maintain the connection pool required for the database. In our case, we did not want to build our own connection management utility so we went ahead and used the service provided by ATLAS Realm. This helps us protect our DB from open connection flooding.

Advantage of This Approach

  1. Scheduling and filtering are totally separate from processing.
  2. Parallelism achieved is way beyond what you can get in a worker-based approach.
  3. Separation of concerns. All event gets processed individually and has the separate environment to run with needed resources like CPU and RAM.
  4. Each worker can be deployed individually.
  5. Retry mechanism and failure handling with the dead latter queue.
  6. Monitoring will notify you even in a single event failure.
  7. Huge cost benefits.

--

--