Solving long running external requests with messaging queues

Lars Gröber
Inheaden
Published in
5 min readOct 7, 2022

Performance in web based applications can be defined by many factors, one of which is response times from a user’s action to the request being successfully handled on the server. On one of our services, we noticed long response times for certain requests. This blog post shows how we investigated and solved the underlying issue.

Investigation

We noticed that requests to register a new account and to send a password reset token had response times of >10s. While it might be expected that creating a new user can take longer, especially the first one, sending a simple password reset mail should not take that long.

Investigating using our APM tool (see figure 1), we saw that sending the actual mail took north of 15 seconds! Apparently our email provider had really high response times themselves.

Figure 1: APM trace of requesting a password reset token

The problem

When requesting a new password reset token, we would run through the following sequence of steps synchronously:

  1. Retrieve the user by email and verify that they can request a new token
  2. Creating a new random password reset token and hashing it (that’s the gap between the first two database calls and the third)
  3. Storing the new hashed token inside the database
  4. Calling the email provider to send a new transactional email containing the reset token
Figure 2: Original architecture

An easy solution might be to asynchronously call the email provider and directly return to the user. While this might be a good solution, our service runs in a serverless environment where containers might be paused, restarted or stopped at any time (between requests). As the request would be done before we could request the email to be sent, we cannot safely assume that the async work would complete.

Queues to the rescue

One way to handle async workloads other than to open a new thread on the current service instance is to utilize queues. Queues come in many shapes and use different technologies (some well known ones are Kafka or Rabbitmq) and can help solve challenges in distributed systems if the work you do is inherently asynchronous or can be completed by a background job.

A simple implementation of a queue is a FIFO list of jobs where once a worker has picked up a task it is removed from the list so that no task can be done twice. This would be at-most-once delivery, while if you only remove it after completion and add a timeout it would be at-least-once. For email delivery we strive for the latter — sending an email twice is not as bad as not sending it at all.

Figure 3 shows a simple overview of the new architecture. While one service handles the incoming request from the user, another service picks up the job/message from the queue and the original request completes quickly.

Figure 3: New architecture

We chose NATS (https://nats.io/) as the underlying technology for our queue. While NATS has many features, we only use here their persistent streams (via JetStream). Scaleway offers a great managed service for using NATS (currently in private beta) — see Scaleway — Messaging and Queueing https://www.scaleway.com/en/betas/#mandq-messaging-and-queueing.

Implementation

Since we develop our backend systems in Golang, we used the official Go library for NATS (https://github.com/nats-io/nats.go).

You have to create a stream first before you can use it, for ease of maintainability we created it in code (all code snippets ignore error handling for readability):

// nc is the nats client
js, _ := nc.JetStream()
js.AddStream(&nats.StreamConfig{
Name: emailNotificationKey,
Retention: nats.WorkQueuePolicy,
MaxAge: 1 * time.Hour,
MaxMsgs: 1000,
})

WordQueuePolicy does exactly what we want, messages are removed after they are acknowledged (aka marked as completed). We cap the number of messages at 1000 and they are auto removed after one hour.

Then you need a consumer of the stream, here we opted for a PullConsumer so that the service can decide itself when it is ready for new messages:

sub, _ := js.PullSubscribe(emailNotificationKey, "email-notifications",
// no more than 20 messages at a time
nats.MaxAckPending(20),
// we try each message 3 times
nats.MaxDeliver(3),
// we wait 60 seconds before trying again
nats.AckWait(60*time.Second),
nats.AckExplicit())

Those options act as a kind of rate limiting — consider that one call to the email service takes 15 seconds, so we will never send more than 20 requests a seconds and never more than 60/15 * 20 = 80 a minute. We set AckExplicit so that we can decide when a message has been successfully delivered.

Now you can call sub.Fetch and get any number of messages as a batch when they are available.

for {
msgs, _ := sub.Fetch(batchSize, nats.MaxWait(batchMaxWait))

for _, msg := range msgs {
var data email.SendMail

json.Unmarshal(msg.Data, &data)

err = s.sendMail(data)
if err != nil {
// don't ack here, so that this message can be retried
continue
}

msg.Ack()
}
}

Results

After implementation of the changes, we saw great increases of response times for the requests we mentioned earlier. See Figure 4 for the trace of the same request as in Figure 1 after the implementation of the queue. We added a new span for the publishing to NATS which seems to be very fast.

Figure 4: Trace after implementation of the queuing system

Looking ahead

While this is a nice proof-of-concept for the given problem, we still have several issues that need to be addressed before using this in any production environment:

  • Calling sub.Fetch inside the serverless container: This can lead to the same issues as starting a new thread. Scaleway is working on a messaging → serverless trigger, though.
  • Double sending: what if the request to the email provider times out or takes longer than the 60s of AckWait? Then we will send potentially three emails to the same user.

Thank you for reading!

Found this post useful? Kindly hit the 👏 button below to show how much you liked this post!

We are a fast-growing tech startup headquartered in Darmstadt, Germany. Incepted in 2017 by 4 co-founders, we now have a team of 20+ experts in Information Technology (IT) and Digital Product creation. As Europe’s first Tech Angel, Inheaden supports startups or small businesses by providing them with the tech strategy, assets, and maintenance they need to thrive in today’s digital era.

--

--