A Long Journey into a Microservice World
This blog post accompanies a talk titled “Scaling Microservices in Go” presented at HighLoad++ on 31st October 2014 in Moscow, which followed Hailo’s journey from a monolithic architecture to a Go based microservice platform. If you’d like to read more about Hailo’s approach to technology, check out more posts on their developer blog where this post is also available.
Hailo, like many startups, started small; small enough that our offices were below deck on a boat in central London — the HMS President.
Working on a boat as a small focused team, we built out our original apps and APIs using tried and tested technologies, including Java, PHP, MySQL and Redis, all running on Amazon’s EC2 platform. We built two PHP APIs (one for our customers, and one for our drivers) and a Java backend which did the heavy lifting — real time position tracking, and geospatial searching.
After we launched in London, and then Dublin, we expanded from one continent to two, and then three; launching first in North America, and then in Asia. This posed a number of challenges — the main one being locality of customer data.
At this point we were running our infrastructure in one AWS region; if a customer used our app in London and then flew to Ireland, no problem — their data was still close enough.
Our customers flying to Osaka however created a more challenging problem as the latency from Japan to Europe is too high to support a realtime experience. To give our customers and drivers the experience we wanted it was necessary to place their data closer to them, with lower latency. For our drivers this was simpler, as they usually only have a taxi licence in one city we could home them to the nearest Amazon region. But for customers we needed their data to be accessible from multiple locations around the world.
To accomplish this we would need to make our customer facing data available simultaneously from our three data centres. Eric Brewer’s CAP theorem shows that it is impossible for a distributed system to simultaneously provide Consistency, Availability and Partition Tolerance guarantees, and that only two of these can be chosen. However, Partition Tolerance cannot be sacrificed, and as we wanted our service to optimise for availability as much as possible, the only option was to move to an eventually consistent data store. Some of our team had prior experience with Cassandra, and with its masterless architecture and excellent feature set, this was a logical choice for us. But, this wasn’t the only challenge we had — our APIs in particular were monolithic, and complex, so we couldn’t make a straight switch.
Additionally we wanted to launch quickly, so we couldn’t change everything in one go. But, as our drivers were based in a single city, we could take a short cut and leave our driver-facing API largely unchanged; cloning the infrastructure for each city we launched in, and deploying these to the region closest to the city. This allowed us to continue expanding, and defer refactoring this section until later.
As a result we refactored our customer facing API; moving core functionality out into a number of stateless HTTP based services which could run in all three regions, backed by Cassandra, written in either PHP or Java.
This was a big step forward, as we could serve requests to customers with low latency, account for movement, and have an increased degree of fault tolerance — both benefitting from Cassandra’s masterless architecture, and having the ability to route traffic to alternative regions in the case of failures.
However, this was merely the first step on our journey.
“My God! It’s full of stars!”
Having dealt with one monolith, we had a long journey to deal with the next monolith. In the process we had improved the reliability and scalability of our customer facing systems significantly, but there were a number of areas which were causing us problems:
- Our driver-facing infrastructure was still deployed on a per city basis, so expansion to new cities was complex, slow, and expensive.
- The per city architecture had some single points of failure. Individually these were very reliable, but they were slow to fail over or recover if there was a problem.
- Compounding this, we had a lack of automation; so infrastructure builds and failovers usually required manual intervention.
- Our services were larger than perhaps they should have been, and were often tightly coupled. Crucially while on the surface they provided a rough area of functionality, they didn’t have clearly defined responsibilities. This meant that changes to features often required modifications to several components.
A good example of the final point is amending logic around our Payment flow, which often required changes in both APIs, a PHP service and a Java service; with a correspondingly complex deployment.
These difficulties made us realise we needed to radically shift the way we worked; to support growth of our customer base, our engineering team, and to increase the speed of our product and feature development.
Working on a small number of large code bases meant that we had a lot of features in play at once, and this made scaling up our team difficult — communication, keeping track of branches, and testing these, took up progressively more time (due to Brooks’s Law). Some of these problems could possibly have been solved with alternative development strategies such as continuous integration into trunk and flagging features on and off, but fundamentally having a small number of projects made scaling more difficult. Increasing team size meant more people working on the same project with a corresponding increase in communication overhead; and increasing traffic often meant the only option was to inefficiently scale whole applications, when only one small section needed to be scaled up.
Our first forays into a service oriented architecture had largely been a success, and based on both these and the experiences coming from other companies such as Netflix and Twitter we wanted to continue down this path. But, with most of our developers not having experience of the JVM (which would have allowed us to use parts of the brilliant Netflix OSS) we would need to experiment.
A Cloudy Beginning
From the start there were a number of guiding principles we wanted to instill into our new platform. One of these was taking a Cloud Native approach; pioneered by Netflix. Adrian Cockcroft has spoken about the approach Netflix took, with their aim during this process being to:
“Construct a highly agile and highly available service from ephemeral
and assumed broken components”
Infrastructure as a service obviously makes scaling much easier, and in Hailo’s case we use a wide range of Amazon’s products; allowing us to match customer demand and react quickly to changing markets, while keeping costs low.
However, regardless of hosting provider, any large distributed system will have components that are failing or degraded at any point in time. This is something we all continue to strive to fix — trying to achieve a utopia of perfect hardware systems, running perfect apps, connected by a perfect network. Unfortunately this isn’t really attainable, and instead we are stuck with a dystopia of buggy apps, running on hardware which often fails, or disappears. This isn’t necessarily a bad thing though — it forces us to face these problems head on, and design software which can benefit from a rapidly changing environment.
This concept of antifragility was popularised by Nassim Nicholas Taleb and is central to becoming cloud native. Most systems become weaker when under stress, however by deliberately introducing stressors into our systems, or chaos in Netflix’s case, we can identify these issues, and design out these weaknesses — vastly improving the reliability of our service.
With these concepts in mind we tried out several different prototypes, and eventually settled on a service oriented architecture which supported both Go and Java services, communicating Protocol Buffer formatted messages over a RabbitMQ message bus. This allowed interesting routing patterns, such as routing to specific versions of a given service, or point to point messaging. Cassandra remained our primary data store given that it was working so well, and fitted in perfectly with our requirements.
So, why Go? Previously we had been using PHP and Java, and while we wanted to retain the ability to write Java on our platform, we were less keen to keep PHP. Go is a small language, and is easy to learn. Being compiled it is mind bogglingly fast (especially when you are moving from an interpreted language like PHP), and features such as its type system and fantastic interface support make writing code fast; giving us improvements both in development and compute time. Go also has excellent concurrency primatives — something of particular importance for us as our infrastructure would require a lot of inter-service communication. Finally, it’s fun to write, and there is an amazing community! So not only did our developers enjoy writing new services on our platform, but we could recruit from an incredibly talented pool of developers in London and across Europe.
But, what even is a Service?
The definition of a service, microservice, macroservice, mediumservice or even a moderatelylargeservice seems to vary wildly depending on who you talk to. As we previously discussed in Hailo’s Webapps as microservices post, Martin Fowler defines the microservice architecture as such:
“The microservice architectural style is an approach to developing a single application as a suite of small services … these services are built around business capabilities and are independently deployable … there is a bare mininum of centralized management of these services, which may be written in
different programming languages”
In our case we defined a service as a small, distinct, unit of responsibility — something that did one job, and did it well. These ranged widely from some very small services with maybe a hundred lines of logic (excluding libraries), to a few which were much larger; such as the system which tracks our drivers in realtime. Regardless, the guiding principle remained the same — that these services should each have clearly defined responsibilities.
Secondary to the responsibilities of the service, was the interface it provided to the outside world. In our case we chose Protocol Buffers, which gave us well defined, strongly typed messages to be passed between our services. Each handler (or endpoint) on a service defined a Request and Response envelope message which it would accept and reply with. As Protobuf is extensible, these could be changed and added to during the lifetime of the service, while still supporting older versions of clients.
Finally there was the internal implementation of the service. Arguably this was the least important part of the service as most services were small, and could easily be rewritten or replaced if necessary. In comparison, changing the interface or responsibilities of the service would usually require changes to other services, with the corresponding cross team communication overheads.
Developers, Developers, Developers
Having established what a service should look like, we now needed to provide tooling to make it extremely easy for developers to build services, so we could get on with solving real-world problems!
Services were based on our ‘platform-layer’ library which abstracted away the message bus transport layer, and delivered messages to and from registered message handlers within the service. This also provided the framework for inter-service RPC calls, service discovery, monitoring information, authentication and authorisation, and A/B testing.
In addition to this our ‘service-layer’ libraries provided abstraction layers over most of our internal 3rd party services, such as Cassandra, Memcache, Zookeeper and NSQ, adding convenience methods, host discovery, automatic configuration, and logic to refresh configuration dynamically as it changed.
To increase productivity we created templates for services, and added Hubot integration so we could quickly provision a new project with a single command:
Finally, the end result of a successful build was a deployable artifact stored on Amazon’s S3. This could then be provisioned to any of our environments using our internal deployment tools.
Always be Shipping
Once developers had built their services, the next step was deploying these. Docker was around version 0.4 when we started working on our platform, and although it was tempting to try using this in production, we didn’t want to spend our time debugging issues until it was more stable. We also wanted something fairly lightweight, so we could swap it out in the future, and it had to be simple to use.
Go made this very easy, as the output of our build was a statically linked binary which could be uploaded to S3 and then downloaded and executed on any machine. Couple this with a command line tool for the platform, along with a snazzy Web dashboard, and we had a deployment system.
Provisioning a service through either the Provisioning dashboard or platform command line communicates with our provisioning manager service (a scheduler which itself runs on the platform). This decides where the service instances should be run. A provisioning service daemon, which runs locally on every machine, polls this and identifies when it needs to run a new service. This then pulls down the service from Amazon’s S3 and manages execution of the binary during its lifetime.
As Docker stabilised we added this into our build chain, so that services could be run within containers. This process is almost identical, with the exception that we pull down a Docker image, and request that the Docker daemon execute the container:
Once a service starts running it automatically discovers and connects to RabbitMQ, and publishes its existence, registering with our service discovery system.
The Binding Service then sets up the the correct queue binding rules within RabbitMQ, connecting the delivery queues to this new service instance. This supports some advanced traffic routing features, including per-version weighting, so particular versions of a service can receive a weighted amount of traffic.
Dealing with Complexity
Decomposing our infrastructure into a large number of small simple services, each with a single responsibility, has benefitted us hugely. It allows us to more easily understand each component and their behaviour, allowing us to scale both our software and teams far more easily, and letting us to develop new functionality extremely quickly.
However, all this speed has drawbacks. At the time of writing Hailo have over 200 services in production, running across three continents, each with three availability zones. This is clearly a massive increase in moving parts, and as such, complexity. Rationalising the behaviour of each individual component may be simpler, but understanding the behaviour of the whole system, and ensuring correctness, is more difficult. Dealing with this complexity was our next challenge.
Rise of the Machines
Testing a complex system for correctness is clearly a good starting point, and like everyone else we built suites of unit tests and integration tests which tested functional behaviour. But if we wanted a high percentage of uptime, and a fault tolerant system we needed to do a lot more.
Testing our systems under load, with failed or failing components, and with degradation was the important next step, and we built an integration framework around this. Simulating cities with ‘robot’ customers booking ‘robot’ taxis for journeys allowed us to put significant load through our systems. Running tools similar to Netflix’s Simian Army during these tests, and ensuring correctness, identified a lot of issues, both in our code and in third party libraries; and fixing these massively increased the resiliance of our systems.
Developing this system further we began running the simulations against our production infrastructure, continuously, in a real world ‘city’. This identified problems which would directly affect customers or drivers using our service extremely quickly, so much so that we now use it as one of our primary monitoring tools.
In addition to using our robot drivers and customers, we built more conventional monitoring systems. Each service automatically published a number of healthchecks via Pub/Sub over RabbitMQ, which were collected by our monitoring service. These included some built in healthchecks which our platform-layer library added, so all services automatically included them, giving the service author health indicators right off the bat.
Built-in healthchecks included service level system metrics, such as if configuration had correctly loaded, or if the service had enough capacity to serve its current request volume. In addition, our service-layer libraries registered healthchecks for the appropriate third-party services each service utilised (such as connection status to Cassandra), handlers were compared against their performance expectations, and custom healthchecks could be registered by the service’s author. This information was aggregated, and displayed in our monitoring dashboard (seen below) which provided auto-generated dashboards for all services discovered by our service discovery mechanism.
We also take measuring everything very seriously at Hailo, and are all paid up members of the Church of Graphs. Instrumenting timing data into Graphite via statsd is almost free, and we built this into all of our internal libraries. This meant that all of our services have a huge amount of performance information available when necessary, and we can provide dashboards for all services automatically.
Finally, distributed tracing tools, such as Twitter’s Zipkin or Google’s Dapper, are invaluable for determining issues in production systems, as they enable the tracing of requests as they traverse through disparate systems. Our tracing infrastructure was built into the RPC library fairly early on, and enables free tracing to developers without any additional code. This was then augmented with a number of web applications which let developers dig into their tracing information.
A good example of this is the diagram below. This was taken from our production environment when we were debugging performance issues on a particular endpoint after new features had been added. Looking at the web sequence diagram we can see that we are calling a number of services, but some of these calls are happening sequentially — this is likely the cause of the performance issues.
Having investigated this, it turned out we could refactor the api endpoint to make a number of these calls in parallel, and aggregate the responses once they had returned.
This reduced the response time of the endpoint from 120ms to under 70ms. And overall this was reduced from nearly 500ms when running through the previous PHP based API.
In addition, we trace (but do not persist to storage) a percentage of internal inter-service requests. This allows us to aggregate performance and success information in memory using Richard Crowley’s go-metrics library, giving us 1, 5 and 15 minute rates and system health, which we can then visualise.
This diagram, taken from late 2014, illustrates the interactions between services running on Hailo’s platform.
Conclusions, aka TL;DR.
During the process of migrating to our new microservice platform we have completely changed the way we build software as a company, enabling us to become significantly more agile, and develop features much faster than before.
Building our platform up first, with a small specific use case, allowed us to test, and gain valuable experience running our new systems in production. We could then expand the scope, gradually replacing areas of functionality and API endpoints with zero downtime; and by picking off specific use cases, and continuously shipping to production, we avoided the common pitfall of the never ending rewrite.
Moving to a microservice architecture is not a silver bullet, and the increased complexity means there are a lot of areas which need to be carefully considered. However, we have found that there are a huge number of benefits which vastly outweigh any disadvantages.
Hailo’s infrastructure is decomposed into a large number of very simple pieces of software — each of which is independently deployed and monitored, and can easily be reasoned about. Tooling and automation simplify operational burdens, and by adopting a cloud native approach with antifragility as a core concept, we have significantly increased the availability of our service.
Crucially with a well developed toolchain it’s extremely easy to create new services, which has lead to emergent behaviour with unexpected novel use cases and features. Developers are freed up to take features from inception to production in hours rather than weeks or months, which is completely game changing (the current record is 14 minutes to staging, and 25 minutes to production); and this ability allows experimentation — our Go based websocket server Virtue is an example of a side project which would likely not have happened otherwise.
Tackling large projects, especially rewrites or replatforms, is always a daunting prospect, and we would never have succeeded without the involvement and support of everyone in the business, and for that we are all truly grateful.
— — — —
- Microservices — Martin Fowler
- Cloud Native at Netflix — Adrian Cockcroft
- Building Products at Soundcloud — Parts 1, 2, 3
- Microservices, Not a free lunch — Benjamin Wootton
- Microservices: Next steps — Boyan Dimitrov
- Moving to Microservices — Boyan Dimitrov