Using AWS SQS over Kinesis for Asynchronous Data Processing

Parag Kadu
VMware 360
Published in
4 min readSep 3, 2021

VMware Workspace ONE Intelligence is a SaaS-based service that provides reporting, analytics, data visualization, and workflow orchestration to Vmware’s suite of End User Computing products. It receives data from a variety of sources including our UEM(Unified Endpoint Management) platform (VMware Workspace ONE) mobile SDKs, Windows agents, and Workspace ONE Trust Network partners. Workspace ONE Intelligence platform is comprised of numerous microservices and lambdas in addition to a number of AWS-managed services. We have a highly distributed team and deployments in regions around the world.

One of the common use cases on our platform is to consume a high volume of analytics data and reliably process it asynchronously, sometimes by multiple consumers. Not losing data is critical. This article describes why we chose AWS SQS over Kinesis for this.

Amazon Kinesis is for the real-time processing of streaming of big data. It provides ordering of records, ability to read and/or replay records with partitioning to multiple consumers.

Amazon Simple Queue Service (SQS) offers a reliable way to move data between distributed application components and helps build applications in which messages are processed independently (with message-level ack/fail semantics).

At first glance, Kinesis looked like a perfect match for our use case. But when we dig a little deeper, we found certain drawbacks with Kinesis.

Kinesis Shortcomings

Kinesis does not support auto-scaling and it is up to the application developer to track shard usage and re-shard the stream when necessary. Kinesis requires complicated producer/consumer libraries, DynamoDB for consumer state management, increasing the development, deployment, and maintenance cost. But the main pain point and a dealbreaker for us is Kinesis checkpointing.

Kinesis checkpointing is the mechanism to mark the position in the stream and to track how far the consumer has read. The Kinesis Consumer Library accomplishes this by storing consumer metadata in a DynamoDB table. Kinesis checkpoint is sequential, and it works well for synchronous processing of records. But when it comes to asynchronous processing, its checkpointing doesn’t allow independent marking of records as successful or failure. For example, an application receives three records with sequence numbers 1,2,3. Let’s say record 3 is processed before 1 & 2. Now we can’t selectively mark record 3 as completed. Check-pointing 3 means marking anything before that as complete.

Using SQS

SQS is simple to use. In contrast to Kinesis, we do not need any special libraries to read from or write to an SQS queue. It easily scales to handle a large volume of messages, without user intervention. It allows us to asynchronously process the messages and independently delete them.

But SQS does not support multiple consumers reading the same messages from the same queue. To achieve multiple consumers support we implemented the following solution.

Multiple Consumers

We use Firehose to first write batched records from the data ingestion source to S3 in the desired format like JSON. S3 bucket is configured for object-created event notification via SNS. The reason to use SNS over SQS directly is to be able to multiplex the notification to multiple micro-services. Applications receive the S3 notification via SQS subscribed to the SNS topic.

SQS Multiple Consumers

Reliability

To ensure that no record is lost, we track all records for completion and message is deleted only if all the records are processed successfully. We have created a S3 SQS Processor library to manage SQS messages, download & cleanup S3 files, track inflight messages & records, dynamically increasing the visibility timeout.

Let us look at both success and failure scenarios. In the sequence diagrams, the S3 SQS Processor is the common library we created. Record Processor represents application processing as per business logic and can involve multiple processors.

In the first diagram, all 4 records from message 1 are successfully completed, and then message 1 is deleted.

Success Case

In the second diagram, record 3 failed and message 1 is retried after the visibility timeout. Also, this means the 3 successful records will be re-processed and the application should be idempotent.

Failure Case (Retry — Record 3 failed)

Such granular record completion tracking, and retrying is not possible with Kinesis streams checkpointing.

Conclusion

With SQS we have a scalable, reliable, and comparatively less complex solution for our data pipeline needs. We have been using this successfully for years in multiple microservices. And just to be clear, we do use Kinesis as well for cases where replay, ordering and partitioning are a requirement.

--

--