SQS throughput over HTTPS with Elixir

Published in

Zappos Engineering

4 min readJan 15, 2016

Enhancing SQS throughput over HTTPS in Elixir

[cross-posted from http://coderstocks.blogspot.com/]

We have a project that aggregates client request latency and error information. The client reports are passed over SQS to our elixir-based processing system which aggregates the data and provides detailed information about aggregate latency and errors.

When we first turned the system on, in shadow testing mode, the processing server was able to handle the client traffic, about 50 messages/second with 10 SQS worker processes.

During Black Friday, still in shadow testing mode, there was an increase in messages to approximately 150 messages/second, which caused the processing server to fall behind.

Initially we tried increasing the number of SQS workers to handle the increase in message traffic. This had no effect on the number of messages being processed and eventually caused connection errors to SQS.

Using ex_top, we were able to introspect the Erlang VM and see that ssl_manager’s message queue was hovering in the 200s. This appeared to be the bottleneck. This prompted us to switch our SQS communication from HTTPS to HTTP. After making the configuration change, we were able to drop down to 2 SQS worker processes, which were able to easily handle the 150 messages/second traffic.

The below graph shows the messages being published and delivered via SNS which pipes into our SQS queues (one for east and one for west).

These next graphs shows the number of messages that were sent and received for each queue. The large spike is the point that we changed from HTTPS to HTTP for SQS communication. Notice how the number of messages received was consistently lower than the number sent until the change. At the time of the change, the number received immediately caught up to the number sent.

The below graph shows the number of available messages in each of the queues, ready to be picked up by a worker process. Notice how it flatlines at zero after we switched from HTTPS to HTTP.

This final graph shows the CPU usage on each of the processing servers. Before the change CPU usage hovered between 50% and 75%. After the change, CPU usage is consistently between 13% and 15%.

Not being satisfied with “we must use HTTP for high-throughput SQS applications” I decided to dig a bit deeper and try to track down a solution. After digging through docs, I discovered that the AWS library we’re using, ex_aws, has a configuration setting which allows users to pass configuration settings to HTTPoison, the underlying http client library. Hackney, the library used by HTTPoison,provides a socket pooling setting where socket connections can be re-used. Adding this configuration to your app tells hackney to use the default socket pool:

config :ex_aws, :httpoison_opts,
  hackney: [pool: :default]

So what are the results of making this change?

With 10 workers polling the SQS queue over HTTPS without a socket pool, you can see that CPU usage is through the roof and the queue took roughly 18 minutes to completely drain.