A tale of calling an external service

Nicolas Grisey Demengel
Apr 28 · 16 min read

Interacting with an external service is one of the many recurrent needs in software. It might seem like an easy task considering all the tooling we have nowadays, but once you start thinking about what may go wrong, it takes on a whole new dimension.

This is a story about how we dealt with that need at Malt.

Note: while the term “external service” may suggest that the service is provided by another organization, it doesn’t have to: what is said here is equally valid for communication between services of our organization. It just happens that I found “external service” convenient to illustrate the point of this article.

The problem we wanted to solve

Whatever the interaction with the service technically looks like — be it a synchronous HTTP call or asynchronous messaging — making it resilient so as to ensure eventual success takes a lot of effort.

Let’s consider an HTTP API:

  • the remote service may return errors from to time to time, independently to our request
  • worst: it may answer with an error while having actually successfully processed the request
  • it may take too much time to answer
  • it may be down and may not answer at all
  • it may impose us all kind of limitations: rate limiting is the most current one (how many calls we can make during a given period), but it may as well be concurrency limiting (how many calls we can make simultaneously)
  • we may have sent an erroneous request, which the service will reject

Let’s now consider the context of a call:

  • the call may be done while synchronously processing a request sent to us (by another service or a user, think HTTP)
  • the call may be done while asynchronously processing a message sent by some other piece of software (think AMQP)
  • the call may be done as part of a batch job (think cron or Quartz)
  • etc.

That context has direct implications for what we can do to cope with the previously listed problems, as we’ll see in the following section.

So we wanted to find a solution that would make calling an external service reliable. The goal being to implement that solution once and then use it where needed.

An additional requirement: call deduplication

When an external service is called as a response to a variety of internal events, it may happen that several events are caught in a very short period, that will trigger identical calls to the external service. Most of the time, it happens in scenarios when we somehow synchronize data with the service: several related actions and effects on our side will each emit a different event that we listen to in order to push up-to-date data to the service.

It happens so frequently that we wanted to have a mean to debounce such calls so as to deduplicate them.

Discussing a solution

A naive approach

The most basic action we can take to address random errors is to directly retry calling the service. But it’s unlikely that we can systematically use this approach or retry too much when a user request is waiting for a response.

Regardless of the context of the call, directly retrying won’t always do: should the service be unhealthy, retrying won’t help it recovering, and therefore won’t help us having an answer either (so even if we don’t care about that service’s health, we better have to be good citizens and stop calling it).

Blindly retrying is also unlikely to solve a rate limiting violation issue (though it could work for some calls “at the limit”).

Emergence of a solution

What we could do instead of just retrying is:

  • for synchronous cases, have a fallback: perhaps the call isn’t that important now; or in case we just want to fetch some data for the UI maybe we can present stale (cached) data or present that data later (via polling, SSE…)
  • have a “better” retry strategy: re-schedule a failing call for later, and augment the retry delay until it passes
  • anticipate rate limiting issues by making sure we never reach the limit to begin with (which means we must implement rate limiting on our side)
  • setup a circuit breaker so that we don’t even attempt to call a service that’s unhealthy

Refining our needs

The latest points all have something in common: the call will be (re)scheduled for later at least once, and we’ve no idea when it will eventually succeed, if it ever does. This in turn means that:

  • we need some way for this mechanism to survive crashes and restarts of our application, unless we’re fine with frequently loosing calls in a black hole (which I doubt given the current discussion)
  • there’s absolutely no guarantee that calls will be made in the relative order they were initially scheduled, which may break certain requests when the state of the remote system matters, or when one request depends on the result of another
  • we need to find out when to stop retrying, and most importantly what to do with the failed commands

To solve the first point, we need some sort of persistent queue of commands to execute.

Please note that I will start talking about “commands” from now instead of just “calls”, since a call to a service most of the time needs some custom code to be be executed. A command therefore represent that small unit of code in charge to execute the call.

We’ll see later how we at Malt managed to either solve the second point or compensate for it in most cases.

As for the third point, we would need some sort of quarantine where to put commands after a given number of failed tries, for later investigation. Which probably means we also need some sort of UI, monitoring and alerts, to visualize failed commands and act accordingly.

Finally, what we were aiming for

A picture being worth a thousand words, let’s draw the desired system:

Building a solution, one step at a time

This rather long section details the process and decisions that led to our solution, for the record. You may directly jump to the section “Our solution” if you only care about the result.

Let’s summarize our need: persistence for commands to be executed, some ordering guarantees, rate limiting, circuit breaking, retries, a quarantine, good monitoring and visualization.

First, let’s get rid of the €1M question: is there an existing tool that could cover all requirements so that we can simply get back to work? Perhaps, but if there is one, we failed at finding it.
Now, there are all kinds of tools that may help us build a solution.

Persistence

Persisting a queue of commands to execute could be done by using a message queue, which could also allow us to plug in our retry logic. Let’s just model our commands as messages and let the consumers either acknowledge their consumption when the call succeeded, or requeue them when the call fails for whatever reason.
To tell the whole story, we’ve been using RabbitMQ for years at Malt to transmit messages (see this 2017 post [FR]), and we do have a retry mechanism with a (capped) exponential delay policy. We even have a quarantine where to put messages that couldn’t be consumed after a certain number of retries (though the quarantine is a recent addition that we’ve made at the same time as what’s described next).

That being said, adding more complexity in that area didn’t feel practical neither safe in the long term. And RabbitMQ isn’t really meant to store something: ideally all messages should be consumed quickly. Finally, RabbitMQ management tools don’t give us anything to visualize failures, retries and so on.

Maybe Kafka would have been a better base. But, we just didn’t know enough about it yet to take any kind of decision. We wanted to solve the problem at hand without having to introduce a complex brick that we didn’t know well.

Anyway, given that we were clearly going to implement the execution logic by ourselves, all we really needed was plain old boring persistence. Fortunately there’s a tool that we already had and knew well enough: PostgreSQL. And as often, having a DBMS with ACID properties will prove to be of great help.

Programming language

At Malt, we mostly code for the JVM and as such this is the target for our solution: the commands to execute are Java/Kotlin code. As for most of our tasks nowadays, it’s been chosen to develop the solution using Kotlin.

Methodology

Now that we had decided on the most binding part, we started developing the solution the TDD way, adding one requirement at a time, and taking more decisions when needed.

General design

The first design decision was about how to represent commands, requests for execution, and the quarantine. Here are the main decisions taken and the final state they’ve led to:

  • A command is a piece of code that can be found by name, and executed by providing it a (Java) Map of arguments. Since we’re using Spring, a command is a Spring component, and the Spring context acts as a command registry.
  • A command execution request consists of a command name and arguments, e.g:
    (name=pushTransactionForInvoice, args={invoiceId, transactionId})
  • A command is associated to a “logical” queue, each queue corresponding to an external service. Therefore, that “logical” queue is the unit for which to track calls and apply rate limiting and circuit breaking logic.
  • Once scheduled, a command execution request is stored as a task in a PostgreSQL table (the queue table), with the following core details:
    task ID, scheduled execution date/time, “logical” queue name, command name, command arguments, command weight (w.r.t rate limiting, see the “algorithm” section)
  • Additional details are stored as well, related to the execution of the command: task state (PENDING or LOCKED), number of execution tries so far, latest execution error if any. We’ll details those later, within the “algorithm” section.
  • Finally, a list of “next tasks” is stored, which are commands to be executed only once the current task has been successfully executed. More details on this later.
  • The arguments of a command and the list of next tasks to execute are stored as JSONB, to account for their variable nature.
  • The quarantine is represented as a second table, storing almost the same details about a task, plus the time of its last execution attempt.

In the end the model is pretty simple and will allow us to perform simple and cheap queries (for instance, there’s no need to JOIN anything).

Rate limiting

Most services nowadays implement rate limiting using the well-documented leaky bucket algorithm or a variation of it. As such, it’s easy to find an existing implementation of it (Guava, resilience4j) or to re-implement it.

But there’s a catch: given that several of our services might call the same external service, and given that most or our services have several instances running anyway, we can’t simply use an in-memory rate-limiting or the various instances will compete without knowing it and we’ll eventually violate the limit. That means we need some kind of distributed rate limiting.

Fortunately, our research led us to ratelimitj which does exactly that, using Redis. Since we happen to use Redis a lot already, this perfectly suited us. Ratelimitj will ensure that call statistics are atomically fetched and stored within Redis, so that all our services have a single view of whether they may call the external service or not. Also, the implementation uses a sliding window strategy that smooths the number of calls over time, which prevent letting lots of calls pass at a time to then blocking all next ones for a time.

Circuit breaking

When possible, it’s better to prevent repeatedly calling a service that’s unhealthy: it won’t help it recovering, and we’ll waste resources retrying.
A circuit breaker monitors the health of the service on our side to decide whether we can continue calling it. Should the service be unhealthy the circuit will open (i.e. stop the traffic), and then probe the health of the service from to time until it’s healthy again, so that the circuit can close and let the traffic go.

Once again, there are existing solutions for that. We’ve been using Hystrix in the past, and now that the project has been discontinued, resilience4j seems to be the right stop.

But we had one more thought that retained us from using it: resilience4j only considers the process it runs into. Since several instances of our services may call a given external service, it looks like waste to let each of those instances determine by themselves that the external service is down after some time, when they could determine it quicker by sharing their call statistics. Why not use a distributed circuit breaker?

While not required, we decided to have a quick try by ourselves since we couldn’t find an existing solution. Long story short, we quickly built a distributed circuit breaker by inspiring from ratelimitj. Call statistics are shared using Redis, and it works like a charm. We plan to open-source it, but this will be the subject of another article ;-)

Metrics, traces, and monitoring

Our need for metrics was quite simple: for each queue and kind of command, we wanted to follow the number of commands being scheduled and know how many of them would succeed, or fail and retry, or eventually be moved into quarantine. This would allow us to understand the traffic and tune parameters if needed.
Given that we’re using Spring Boot for our Java/Kotlin applications, there was no decision to take here: we would just use Micrometer as usual to publish gauges with the appropriate tags, and then follow those metrics in Datadog, which is a (good) monitoring SAAS we happen to use.

To better understand what would happen for debugging purposes, we also wanted to have detailed traces of all commands scheduled, executed, and so on. Once again, this was a no-brainer: we would log structured traces and use Datadog to browse those logs, as usual.

Finally, we wanted to be alerted when commands would be moved to quarantined. Monitoring the quarantine gauge mentioned above would be enough: Datadog would send us emails, Slack notifications, etc. Additionally, we decided to log an error when a task is moved to quarantined, which will be reported on our Sentry dashboard as well.

Visualization

While it’s appreciable to have logs, metrics and alerts, we wanted to have a mean to visualize tasks scheduled or quarantined at any time with their number of tries and any possible error, as well as to have actions to act on those tasks (remove them, reschedule them, etc.).

So basically we’re talking about views querying our two tables. We judged that it wouldn’t cost too much to write a simple REST API and a Vue.js frontend for that, as we’re already used to code most of our features and tools using it.

Our solution

Without further ado, here is how our solution looks like.

Usage

One may configure a new “command queue” for a given service as follows:

You may find all available queue options here.

While a command execution request could have been defined as a simple object containing the name of the command to execute and a (Java) Map of arguments, we’ve decided to rely on the type system a bit more.
Here’s how to declare a command that can be executed and a “specification” that will be used to request an execution of that command:

You may find all available options for command specifications here.

And here is how client code can schedule an execution of that command:

Though it may look like an obscure and useless ceremony at first, it actually gives us two advantages:

  1. Only the command declaration code is responsible to change typed arguments into a Map of primitive types and vice-versa, not the client code.
  2. The command name is encoded within the specification type (see the COMMAND_NAME constant), that way the client code doesn’t have to know about it.

Those two points allow for more safety, and a cleaner experience for the client code.

And if you wonder what’s this COMMAND_NAME field: we impose all CommandSpecification subclasses to have one as if Java/Kotlin allowed us to impose an interface for the class itself (with a combination of both a runtime check and unit tests performing a code analysis). That queue name can be asked to any Command or CommandSpecification class or instance via a dedicated function. This is how the CommandScheduler and the rest of the logic can access it.
While not abusing of them, we occasionally use that sort of tricks at Malt: we’ve already written about a similar technique for our events sent over RabbitMQ (French content).

The algorithm

Given all that’s been said until now, one could fear that we have written a lot of code to implement the solution. But it’s in fact quite short, the algorithm itself being composed of around 20 tiny functions adding up to ~300 lines of Kotlin code.

You can see the core logic here (this is the actual code):

What if my command needs to perform several calls?

I’m glad you asked. It may legitimately happen that in order to fulfill some logical need, we have to perform several calls in a row to a a service. In such a case, either all the calls succeed, or the whole operation is a failure. This is why CommandSpecifications may be given a weight, which will the number of permits requested to the rate limiter.

What about task deduplication?

Task deduplication happens when a command execution request is scheduled, not at consumption time. Should the same task be already pending, then the given one is skipped.

That logic works because when a task is scheduled, it’s first execution time is set to a later time, according to the configuration of the queue. The time during which a scheduled task is pending is the time during which duplication will be prevented.

How are retries handled?

Until the maximum number of retries is reached, failed task are rescheduled at an (exponentially) later time.

And what about task ordering?

The queue only guarantees that tasks will be executed in the order of their scheduled execution time (or concurrently). It means that once a task fails and is rescheduled, a initially later task may be executed first.

If that cause the task to fail, it will be rescheduled as well, which means that those dependent tasks might still eventually be executed in order and succeed if not reaching the quarantine limit.

But one should not count on that, that’s why there is a mean to define a logical sequence of tasks to execute. In the following example, all but the first tasks will be attached to the first one as “next tasks”:

The whole picture

Here is the whole flow again with slightly more details, to be compared with the previous code. Yes, we’ve achieved it!

Visualization

Here are some screenshots of our tasks visualization UI:

Image for post
Image for post
A command may be part of a logical sequence of commands to execute (see “+1 next tasks”)
Image for post
Image for post
More details can be obtained by expanding a row
Image for post
Image for post
A color scale highlight retries, the full stack trace of an error is available by expanding the row

Monitoring & Alerting

Our monitoring tools have been configured so as to have all the relevant details and to alert us when anything goes wrong.

One the one hand, detailed logs and metrics allow us to known precisely what is happening.

On the other hand, we’re alerted when something goes really bad. Note that retries don’t trigger alerts, only tasks moved to quarantine or an open circuit do.

Limitations of our solution

It may be that one day our design will become inadequate compared to our usage, but we’re far from it. For now, here is our usage of the tool, considering all “logical” queues:

Regardless of the usage, polling a PostgreSQL table might not be a wonderful idea either. But that polling is of maximum once per second, per “logical” queue.

Talking about queues, they’re all managed within the same tables, which could be a performance issue given a greater usage. Should it happen, we could dedicate tables to each queue, at the price of a bit more configuration when one wants to add a queue (we would now need to create the tables).

A more practical limitation of our system is that it only handles synchronous operations for now, but we could adapt it so as to receive the result of a command execution in an asynchronous way.

Conclusion

I’ve described here in great details — but still not all details ;-) — our solution for calling an external service, and I hope you’ve find some interest in it.

It surely was a fun exercise to develop it, but it fortunately has also delivered the expected benefits: the cost of calling an external service in a resilient way is much lower for us today. All one has to do is to write the lines of code given in the usage example above.

In case you really enjoyed it and were wondering: it isn’t open-sourced and it won’t be, in all likeliness. Partly because we’ve tuned according to our needs, and partly because it may well be that it wouldn’t handle well loads of different magnitudes. All in all it could become a management nightmare for us to open-source it now. But our distributed circuit-breaking logic will be open-sourced, stay tuned!

That being said, we still thought it could be an interesting view to share, and that’s what this article is for. Should you have any question, don’t hesitate to ask them here!

nerds-malt

Tout ce qui se passe sous le capot par l’équipe R&D de Malt…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store