Removing bottleneck with SNS/SQS Fan-Out

Published in

Sainsbury’s Tech Engineering

5 min readAug 26, 2022

Managing customers' online orders require processing a few events. With each order generating at least five different events, and 100K orders processed daily, the total number of events processed during a day can reach upwards of 600K. Add to this the requirement for messages to be replayable, queued, events having multiple consumers, the data replication across multiple Availability Zones and an AWS-managed ActiveMQ bus can seem like the right option to fulfil those requirements.

And a pandemic later…

At the height of the pandemic, we saw, as many in the industry, an increase in online orders. With order numbers beyond projections, we were often nervous to see how our message broker would cope with the volume of events and how it would perform during an outage. We wouldn't want to lose messages during an outage, adding insult to injury. The number of messages is not the only constraint we have to work with. The proposition is evolving, meaning more events to process and more pressure on our rusting message bus.

What if we do nothing?

The worst option, obviously, and not one we were to take. Unfortunately, we had a preview of what to expect if the message broker was to fail. As we experienced during an outage, the message bus is crumbling. Because of the way ActiveMQ enqueues messages — Broker Storage, recovering messages, restarting it and reloading all its messages can take up to a day, putting us at risk of delivering orders without taking payment for them or, worst, losing customers' orders. It is time to take action before it is too late.

We drove the bus, we drive our decisions.

The team was in charge of choosing which technology to use to replace the crumbling message broker. And what to choose when we have so many options. Replacing our message broker is not a decision taken lightly, and before going with a solution, the team went through due diligence to ensure a great future for the platform and retire the crumbling message broker. Kafka, Event bridge, RabbitMQ, and Kinesis Stream, … we decided to go with AWS SNS Topic and SQS queues. Although each available solution had good reasons to be chosen, we decided to implement a Fanout pattern with many SQS queues subscribing to a single SNS Topic. The events we process are routed to a queue based on a filter driven by the message attribute.

Why choose SNS/SQS?

With this approach, many of the blockers faced with ActiveMQ become a thing of the past.

Unlimited storage
120,000 messages in flight
Reduced cost
Good AWS native integration
Independently scalable
Integration with API Gateway

Switch from AMQ to SNS/SQS

The decision was made, and the message broker was to be replaced, but this will require some changes in our services. We decided as a team we'll implement it as a mob. The changes touch a large scope, from infrastructure to a few code bases. It is the perfect exercise for some knowledge sharing with the newer members of the team. So we mobbed on this migration; we shared documentation and best practices. The objective is to make sure anyone can work on this migration from any point in time, and anyone involved in this migration will be able to do the same knowledge transfer if such migration was to happen again.

All this is a lot of theory, but what about the result?

To measure our improvement, we needed a benchmark. Gatling has been part of our tooling from the start, and with it, we can run a load test with the production volume of orders in our staging environment and study the result.

From AWS SNS specifications, we expect quicker responses for our service, less than 8 seconds — SLA agreed with the external producer… and we did.

Gatling benchmark with production volume.

Here we can see in our message broker benchmark against 150K orders that one order request failed, taking more than 13 seconds to respond, and 99% of the requests responded in 2 seconds or less.

Gatling test result for SNS/SQS implementation with production volume.

Here we can see in our test for SNS/SQS implementation against 150K orders that all order requests were processed in less than 1.53 seconds, without any failed requests, and 99% of requests responded in 67ms or less.

Another useful tool at our disposal is Grafana. With it, we can see how the volume of orders was processed over time in both our API Gateway and our consumer, the Customer Order Service.

Benchmark dashboard. Top panel: API Gateway processed order. Bottom panel: Customer Order Service consumed orders.

From this panel, we can see how long it took to process the same volume of orders before and after the change.

Post-implementation dashboard. Top panel: API Gateway processed order. Bottom panel: Customer Order Service consumed orders.

The results validate our hypotheses and give us confidence our change will give us better performance.

Before and After

The real test remains in production. A deployment plan later, a production release, and a good night's sleep later, here it is.

Production dashboard before SNS/SQS. Top panel: API Gateway processed order. Bottom panel: Customer Order Service consumed orders.

Production dashboard after SNS/SQS. Top panel: API Gateway processed order. Bottom panel: Customer Order Service consumed orders.

The surgeons did an excellent job on this dashboard's plastic surgery. Not only did our API Gateway processing rate improve, but we are also consuming messages at a higher rate using more of the resources initially allocated to our consumers.

What's next?

There is always room for improvement. Using SNS means we can simplify the integration with the order producer and replace our API Gateway service with an AWS API Gateway without worrying anymore about scaling.