Polling reliably at scale using DLQs

How we bid goodbye to long waiting times during order execution.

Tarun Batra
Jul 16 · 5 min read

Execution of computer programs is blazing fast. It’s only practical to sometimes need a delay in execution. Some of such use cases are:

  • scheduling a task for some time in the future;
  • retrying a failed task with a backoff strategy

At smallcase, we place orders to buy or sell equities on clients’ behalf. When the order is placed, it’s execution status is not immediately known. We need to poll the partner broker to know the status of the order on the exchange.

Intraday traders will tell you that stock markets are time-sensitive. Any platform that caters to them needs to be fast and deterministic. Polling for order status needs to happen in a timely manner after regular intervals because an endless loading is never a pleasant sight.

This blog explains the reliability issues we faced with our legacy system for scheduling polling and how we fixed it.

Delay mechanisms:

Redis Keyspace Notifications

>redis SETEX TESTKEY 10 “TESTVALUE”

Parallely, Redis provides Keyspace notifications for data changing events which can be subscribed by clients and can be used to trigger the delayed tasks. Key expiry events can be subscribed using:

SUBSCRIBE __keyevent@0__:expired

Pros:

  1. Arbitrary delays. Delays of arbitrary length are possible without any extra configuration. Just changing the ttl parameter of SETEX command will work.

Cons:

  1. Unreliable. The keyspace notifications can get highly unreliable as the number of keys increase. This is clearly mentioned in the docs:

If no command targets the key constantly, and there are many keys with a TTL associated, there can be a significant delay between the time the key time to live drops to zero, and the time the expired event is generated.

This worked fine for us in the beginning. But once the orders increased, a lot of clients started reporting really long waiting periods before they could see the status of their orders. On debugging, we saw delays of the tune of 40x and this was the deal-breaker for us.

In-memory timers

setTimeout(() => {
console.log(‘This executes after 10 seconds’);
}, 10 * 1000);

Pros:

  1. Reliability. Timers in the code are quite reliable. Deviation of only up to few milliseconds is observed.

Cons:

  1. Unscalable. In-memory timers do not work well with horizontal scaling. Timer will always trigger the instance which set it in the first place, irrespective of the traffic distribution across the instances.

Timers worked well to mask the problem until we had bandwidth to think about the problem and fix it for good. Now we needed a permanent and reliable solution that would scale.

Dead Letter Queues

In very basic terms, a message with an expiry is put into a queue. The queue is instructed to take some action if the messages are expired without being read. If the expired messages are pushed to a different queue which is actively consumed by the target application, it simulates a delayed delivery of the message. Following diagram explains this concept:

Polling delay mechanism

We chose RabbitMQ based on AMQP 0–9–1 for our implementation due to the maturity of the protocol and its community.

Pros:

  1. Scalable. Unlike Redis events, message delivery in MQs can be configured to use worker pattern which load balances the consumers (subscribing instances of the application) to deliver the message.
  2. Reliability. Message Queues are expected to be highly reliable in the delivery of messages. In our testing with RabbitMQ, the results were found to be close to in-memory timers.
Source: RabbitMQ docs

Cons:

The chart shown below analyzes the average delay observed in resolving the orders we handled in the past year and its relation with the delay mechanism being used.

Graph showing consistency in the polling scheduled using RabbitMQ DLQs

It is clear that the polling delay in Redis was large but consistent to start with but grew out of proportions quickly as the volumes increased. This was quickly fixed using timers but like all quick-fixes, it had a short life too. Finally, we revamped our systems and used DLQs.

It is evident from the graph that DLQs brought consistency to the system.

We traded flexibility with robustness and reliability by using DLQs to manage the delays in our polling for order status. If you’ve solved a related problem in any other way, do let us know in the comments.


Originally published at https://blog.smallcase.com on July 16, 2019.

Making smalltalk

Stories from inside smallcase

Tarun Batra

Written by

Punjabi | Biker | Software Developer @smallcase

Making smalltalk

Stories from inside smallcase

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade