How we built our service mesh with Consul, RxJava & Zipkin

Michael de Jong
Oct 25 · 9 min read

Let’s go back to 2015

Back in 2015 the microservices hype was really kicking off. Back then consisted of 3 separate services:

  • rest-api : our monolith backend which exposed a RESTful API
  • web-frontier : our user-facing service serving up an AngularJS app
  • oauth : our authentication and authorisation backend

The concept of microservices promised us more flexibility: scaling up individual components of our backend, the ability to try out different programming languages, libraries, and approaches, while also limiting the blast radius of outages in case of failures. But it would also mean a far more complex infrastructure, having to run, monitor, and automate general management of a multitude of services, as well as way of getting services to talk to each other somehow. After a short period of prototyping we ultimately settled on the approach covered in this post.


In 2015 we opted to use Consul to manage Service Discovery, and handle the configuration for our (micro)services in one centralized place. Back then there weren’t that many options for Service Discovery, and seemingly every platform and tool had a different approach. It’s worth noting that this was before Kubernetes, Docker swarm, etc really took off, and it really wasn’t clear which one was going to “win” in this space. We also didn’t really want to buy into a specific tool, and potentially end up depending on a tool that was losing community support in the long-term. Instead what we wanted was to pick the best approach for us, and find the best tool to support that approach. For us this was Consul.

Consul makes Service Discovery very easy. It even exposes a DNS service to your cluster which your own services can use to find each other. In this setup you’d simply query that DNS service to get a location (IP address + port) for the service you’re trying to contact. However, for reasons that will become more apparent later on, we decided to go a slightly different route.

What were our options?

Back then there were basically 3 options we could pick from: Relying on DNS to locate other services, using a load-balancer with dynamic routing to direct traffic to a specific instance, or using a sidecar container which magically handles the network for you (typically these came along with a container orchestration tool).

It’s important to note, that each service either registers itself somewhere on start-up, or that it’s automatically registered by the container orchestration tool when the service is started.

Relying on DNS

With this option, you’re using a custom DNS service to keep an actively updated view of where every instance of every service is running. Services can either lookup addresses with a DNS client, or simply configure their runtime to use the DNS service as the default nameserver. In that case if the front-end service wanted to retrieve information about a specific user from the backend, the front-end service would simply ask therest-api service by sending a HTTP request to http://rest-api.local.consul/users/1 or something similar.

Using a load-balancer

One alternative is using a centralized load-balancer of which every service knows the location, to redirect requests to one of the instances of the targeted service. The configuration of the load-balancer is actively updated as services get registered and deregistered from the cluster. Using this approach you’d simply send a HTTP request like and the load-balancer would send it to an instance of rest-api.

Using a the sidecar-pattern to handle networking

This one felt a bit like black magic, and leverages container isolation. If your service runs in a container, with this pattern your service is intentionally kept isolated from the rest of the network. Instead it can only talk to one other container called the sidecar. This container receives HTTP requests from your service, and then directs them through the cluster, relaying back responses it receives.

Yeah, that’s no bueno…

We didn’t really like any of these options. Primarily because all of these options try to abstract away all the details and all of the control from your service. Your service has no control over which specific instance of the service gets its request, only the guarantee that some instance will receive it. Where that instance is located, or what the state of that instance is, is entirely hidden from the service, while there’s no real reason to do so.

At the same time we were inspired by Netflix’s approach to microservices using Hysterix. Hysterix is a library that allows services to services to talk to each other in a distributed environment with fault-tolerance built in. Requests get automatically retried on failures, and if downstream services get overloaded or become unresponsive, circuit-breakers prevent requests from being sent to the affected services, allowing them to recover.

We really liked this approach but it also takes a considered approach to communicating this way. Not only do you need to define what your service should do if a request cannot be executed within a timely fashion, but also how it can find and talk to several other instances autonomously.

Our approach

Our services register themselves with a local Consul agent when they start up, and automatically deregister themselves when they terminate. In addition to this a service exposes its health over a HTTP endpoint, which it instructs Consul to check periodically. To set this up, we use our own open-sourced library called Consultant.

When we want to make a request to another service we do the following:

Example of how our setup would recover from a downstream request that fails, and is subsequently retried on another instance.

To achieve all of this we wrote our own wrapper around async-http-client which uses Consultant to talk to Consul, and uses RxJava internally for sending off the request and retrying the request if it couldn’t send it to the selected instance of the targeted service. This allows us to perform HTTP requests in a familiar way, while performing all of the service locating, and retrying on a lower level. We already tend to write tiny clients for most of our microservices in our most predominantly used programming language in the backend, so this makes it really easy to call any services from almost any other service.

Simplified example of how we’d describe a tiny client for getting users.
Simplified example of how we can now fetch users over the network.

But we can do better…

This only retries requests if they failed with a 503 Service Unavailable status code. What if we want to retry on an arbitrary error code, or an exception being thrown because of some entity could not be found, or networking issues throwing IOExceptions? For this we implemented an ExponentialBackoff class which will reattempt whatever Single or Observable stream it is attached to, when a problem occurs. On which errors it retries, as well as how often and how fast can all be configured.

In a distributed environment that means that you’ve drastically improved your chances of successfully executing a request, even if the network is having an off day. But this brings us to another problem: big fan-outs with retried requests.


When you take the microservices route, you’ll quickly end up running a big variety of services — all preferably with multiple instances. But when you grow the number of services and instances over time, you’ll eventually end up with a situation where a user request will arrive in your brokering service — sometimes also called a BFF — , which then needs to contact a variety of services, sometimes even in a nested fashion. We’ll discuss the details on how we do this at in another post. But as you can imagine it’s important that you build some kind of logging mechanism for internal requests belonging to the same user request. If not you’ll never know why a certain request is timing out, or returns a generic 500 Server Error .

To deal with this in our stack, we chose Zipkin a couple of years ago. This means that when a user request enters our infrastructure, it’s assigned a unique trace ID. This trace ID is passed along through every internal request to downstream services. Every internal client, and internal service, registers their outgoing and incoming requests with that trace ID to Zipkin. This allows Zipkin to know which internal requests belong to which user request, and produces an easy to read interactive graph, allowing you to easily pinpoint any issues for any given user request.

Although you can search for a specific trace in Zipkin’s UI fairly easy, we also expose the trace ID in a HTTP header when answering the user request. That way it’s easier to look that specific trace up, if you come across an anomaly in the logs, or in your own browser.

We have since also started tracing operations on our Elasticsearch cluster as well as operations on Redis instances. This can give a very detailed view of what actually happens when a user request is processed. In turn this may also lead you to the understanding that some fan-outs have simply gotten too big, and may need some optimization.

Example of a Zipkin trace showing how the user request fanned out to multiple internal requests between our microservices. It’s also showing which requests/operations were slow relatively speaking.


We hope this gave you a nice peek into some of the core principles that shaped our stack and infrastructure over the years. Although the bulk of this work was done in 2015 and 2016, much of it is still very relevant, and we see no reason to change anything.

If we had to build from scratch today, we might not pick the exact same tools, but the principles we followed are still sound and we’d probably follow those again. Engineering

Finding the best job for you with technology

Michael de Jong

Written by Engineering

Finding the best job for you with technology

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade