Let’s go back to 2015
Back in 2015 the microservices hype was really kicking off. Back then Magnet.me consisted of 3 separate services:
rest-api: our monolith backend which exposed a RESTful API
web-frontier: our user-facing service serving up an AngularJS app
oauth: our authentication and authorisation backend
The concept of microservices promised us more flexibility: scaling up individual components of our backend, the ability to try out different programming languages, libraries, and approaches, while also limiting the blast radius of outages in case of failures. But it would also mean a far more complex infrastructure, having to run, monitor, and automate general management of a multitude of services, as well as way of getting services to talk to each other somehow. After a short period of prototyping we ultimately settled on the approach covered in this post.
In 2015 we opted to use Consul to manage Service Discovery, and handle the configuration for our (micro)services in one centralized place. Back then there weren’t that many options for Service Discovery, and seemingly every platform and tool had a different approach. It’s worth noting that this was before Kubernetes, Docker swarm, etc really took off, and it really wasn’t clear which one was going to “win” in this space. We also didn’t really want to buy into a specific tool, and potentially end up depending on a tool that was losing community support in the long-term. Instead what we wanted was to pick the best approach for us, and find the best tool to support that approach. For us this was Consul.
Consul makes Service Discovery very easy. It even exposes a DNS service to your cluster which your own services can use to find each other. In this setup you’d simply query that DNS service to get a location (IP address + port) for the service you’re trying to contact. However, for reasons that will become more apparent later on, we decided to go a slightly different route.
What were our options?
Back then there were basically 3 options we could pick from: Relying on DNS to locate other services, using a load-balancer with dynamic routing to direct traffic to a specific instance, or using a sidecar container which magically handles the network for you (typically these came along with a container orchestration tool).
It’s important to note, that each service either registers itself somewhere on start-up, or that it’s automatically registered by the container orchestration tool when the service is started.
Relying on DNS
With this option, you’re using a custom DNS service to keep an actively updated view of where every instance of every service is running. Services can either lookup addresses with a DNS client, or simply configure their runtime to use the DNS service as the default nameserver. In that case if the front-end service wanted to retrieve information about a specific user from the backend, the front-end service would simply ask the
rest-api service by sending a HTTP request to
http://rest-api.local.consul/users/1 or something similar.
Using a load-balancer
One alternative is using a centralized load-balancer of which every service knows the location, to redirect requests to one of the instances of the targeted service. The configuration of the load-balancer is actively updated as services get registered and deregistered from the cluster. Using this approach you’d simply send a HTTP request like
http://rest-api.services.magnet.me/users/1 and the load-balancer would send it to an instance of
Using a the sidecar-pattern to handle networking
This one felt a bit like black magic, and leverages container isolation. If your service runs in a container, with this pattern your service is intentionally kept isolated from the rest of the network. Instead it can only talk to one other container called the sidecar. This container receives HTTP requests from your service, and then directs them through the cluster, relaying back responses it receives.
Yeah, that’s no bueno…
We didn’t really like any of these options. Primarily because all of these options try to abstract away all the details and all of the control from your service. Your service has no control over which specific instance of the service gets its request, only the guarantee that some instance will receive it. Where that instance is located, or what the state of that instance is, is entirely hidden from the service, while there’s no real reason to do so.
At the same time we were inspired by Netflix’s approach to microservices using Hysterix. Hysterix is a library that allows services to services to talk to each other in a distributed environment with fault-tolerance built in. Requests get automatically retried on failures, and if downstream services get overloaded or become unresponsive, circuit-breakers prevent requests from being sent to the affected services, allowing them to recover.
We really liked this approach but it also takes a considered approach to communicating this way. Not only do you need to define what your service should do if a request cannot be executed within a timely fashion, but also how it can find and talk to several other instances autonomously.
Our services register themselves with a local Consul agent when they start up, and automatically deregister themselves when they terminate. In addition to this a service exposes its health over a HTTP endpoint, which it instructs Consul to check periodically. To set this up, we use our own open-sourced library called Consultant.
When we want to make a request to another service we do the following:
- We use Consultant to ask Consul to give us a list of instances that the local Consul agent is aware of.
- Since Consul keeps track of the health of a service for us, we ask it filter out any unhealthy or unresponsive instances, and sort the list of instances by network distance (from closest by to furthest away).
- We then randomly select and remove one instance from the list (heavily biased to closer by instances). This acts as a bit of a round-robin effect, ensuring every instance gets a similar workload, while still avoiding services which would take longer to talk to.
- We then send off our request to that instance. And if that request is successful, that’s basically the end of the story.
- However, if we receive a
503 Service Unavailable, we’ll know that this instance is no longer accessible or capable of handling requests. We then select another instance from that same list (excluding instances we already tried), and resend the request to that instance. It’s important to note that this is only possible because we have the ability to locate all these instances, and we have control over which instance we contact. If we didn’t, there would be a significant chance we’d be sending requests to instances we already tried sending this request to before, but failed .
- We repeat this last step until either the request succeeds, a pre-specified timeout is exceeded, or if we tried all of the listed instances. In these cases, it’s up to the service performing the request to determine what happens, and in most cases the error propagated back to the upstream service which made the request to our service, which in turn may decide to retry the request by itself.
To achieve all of this we wrote our own wrapper around
async-http-client which uses Consultant to talk to Consul, and uses RxJava internally for sending off the request and retrying the request if it couldn’t send it to the selected instance of the targeted service. This allows us to perform HTTP requests in a familiar way, while performing all of the service locating, and retrying on a lower level. We already tend to write tiny clients for most of our microservices in our most predominantly used programming language in the backend, so this makes it really easy to call any services from almost any other service.
But we can do better…
This only retries requests if they failed with a
503 Service Unavailable status code. What if we want to retry on an arbitrary error code, or an exception being thrown because of some entity could not be found, or networking issues throwing IOExceptions? For this we implemented an ExponentialBackoff class which will reattempt whatever Single or Observable stream it is attached to, when a problem occurs. On which errors it retries, as well as how often and how fast can all be configured.
In a distributed environment that means that you’ve drastically improved your chances of successfully executing a request, even if the network is having an off day. But this brings us to another problem: big fan-outs with retried requests.
When you take the microservices route, you’ll quickly end up running a big variety of services — all preferably with multiple instances. But when you grow the number of services and instances over time, you’ll eventually end up with a situation where a user request will arrive in your brokering service — sometimes also called a BFF — , which then needs to contact a variety of services, sometimes even in a nested fashion. We’ll discuss the details on how we do this at Magnet.me in another post. But as you can imagine it’s important that you build some kind of logging mechanism for internal requests belonging to the same user request. If not you’ll never know why a certain request is timing out, or returns a generic
500 Server Error .
To deal with this in our stack, we chose Zipkin a couple of years ago. This means that when a user request enters our infrastructure, it’s assigned a unique trace ID. This trace ID is passed along through every internal request to downstream services. Every internal client, and internal service, registers their outgoing and incoming requests with that trace ID to Zipkin. This allows Zipkin to know which internal requests belong to which user request, and produces an easy to read interactive graph, allowing you to easily pinpoint any issues for any given user request.
Although you can search for a specific trace in Zipkin’s UI fairly easy, we also expose the trace ID in a HTTP header when answering the user request. That way it’s easier to look that specific trace up, if you come across an anomaly in the logs, or in your own browser.
We have since also started tracing operations on our Elasticsearch cluster as well as operations on Redis instances. This can give a very detailed view of what actually happens when a user request is processed. In turn this may also lead you to the understanding that some fan-outs have simply gotten too big, and may need some optimization.
We hope this gave you a nice peek into some of the core principles that shaped our stack and infrastructure over the years. Although the bulk of this work was done in 2015 and 2016, much of it is still very relevant, and we see no reason to change anything.
If we had to build Magnet.me from scratch today, we might not pick the exact same tools, but the principles we followed are still sound and we’d probably follow those again.