From Monolith to Microservices

How we’re rebuilding our infrastructure from scratch.

Poki Engineering
Feb 20, 2017 · 9 min read
Containerizing Minecraft 🤖

The first step in the rebuild of our web platform is a proper foundation: the infrastructure. And our title has already given away the first decision that we’ve made: we’re going for microservices.

This post will explain our move to microservices, our choice for Kubernetes, and some other infrastructure-related decisions. For more context on our project and its objectives, check out our previous post on this topic, in which we explain why we’re rebuilding from scratch.

So why microservices? Many blog posts have been written about the pro’s and con’s of microservices in general, so we’re not going into that.

Instead, we’ll dive a bit deeper into what we as a team like about microservices in particular.

Microservices are designed for change

We’re building a web platform, and the web is complex and changing rapidly. We’re assuming from the start that parts of our platform will eventually be scrapped. Microservices make that easier.

Microservices also require us to design our systems for failure. Certain services might be unavailable at times and our systems needs to deal with that gracefully.

The latter could be seen as a downside, because designing systems in this way requires extra thought, introducing additional overhead. However, it also means that our teams have to constantly reflect on how service failures have an impact on the user experience. We encourage our teams to take that kind of ownership, so we see that as a pro rather than a con.

Microservices empower small, cross-functional teams to take ownership of our product

Microservices are standalone services that can be built, deployed and maintained in isolation from other services and systems. Because of their small scope and separated nature, teams are enabled to take full responsibility for services and thus our product.

Amazon coined the “You build it, you run it” phrase to describe this ethos:

The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. This customer feedback loop is essential for improving the quality of the service. — Werner Vogels (CTO @ Amazon)

This type of ownership could technically also be achieved within a monolith, but because monoliths aren’t as strict on modularity, it’s a lot harder to keep clear modular lines between product domains. Microservices explicitly require separation, which makes it easier to keep team boundaries and responsibilities clear.

One additional benefit of their isolated nature, is that microservices don’t force you to use a single programming language. Even though we like Go — it’s fast and it works well for us — we believe in using the best tool for the job, so we’re not ruling out the use of other languages in the future.

Microservices support the way we envision our organizational culture

Conway’s Law tells us that “any organisation that designs a system, will produce a design whose structure is a copy of the organization’s communication structure”.

Like our systems, we want our teams to be tightly aligned, and loosely coupled. We believe in the effectiveness of interdisciplinary product teams of developers, designers and data scientists that understand and tackle business problems from different angles.

Spotify did a great job implementing these principles in their engineering culture.

Microservices are going to help us scale

Microservices allow you to scale just the part of the application that needs extra resources, rather than the entire monolith.

For us this makes a lot of sense, because there’s a big difference between services that get a large amount of requests (say, an analytics service that gets hit many times for each user) and lightweight components (e.g. an image cropping service that only gets used a couple of times per day).

These scaling benefits of microservices will have a positive impact on speed, ease of deployment, and the costs of our infrastructure.

Monoliths and Microservices — via MartinFowler.com

Now that we’ve touched microservices and their main benefits for us, let’s talk about putting them to work.

Orchestrating our services

At the moment, our monolith is still in production as one large service, living on dedicated instances on Amazon Web Services (AWS). In our microservice environment however, we want multiple services to be able to run efficiently on single instances.

This is where Docker comes in — containers provide a way for us to run multiple services on a single server. Since we don’t want to manually decide which service goes where and when, we went shopping for a container orchestration solution.

We did an internal evaluation of three systems by setting up hello world applications, and then made our decision.

We first tried AWS ECS, Amazon’s fully managed container management service. Despite not having a pretty interface, AWS ECS worked like a charm. We had logging, monitoring and horizontal auto scaling both in containers and machines. The “out of the box” machine auto scaling in particular was a killer feature compared to Kubernetes running on AWS. However, because of the lack of documentation and open community, as well as vendor lock-in, we weren’t convinced yet.

Next, we explored Mesosphere Marathon (DC/OS), which is a collection of container-related technologies by Apache. We took a deep dive into the communities and documentation, and concluded that given the alternatives, DC/OS did not feel right for us us. We liked the interface, but the overall feeling was that DC/OS was aimed more at enterprises that also want to efficiently manage their dedicated servers somewhere in a rack. We don’t see us leaving the cloud anytime soon, and we felt that there was too much overhead for our use case.

The final option that we tried and considered was Kubernetes: an open-source container cluster manager originally designed by Google. Kubernetes was born out of Borg, Google’s internal resource allocation system, and is, acccording to Wired, “one of the best-kept secrets of Google’s rapid evolution into the most dominant force on the web”.

What we liked about Kubernetes in particular:

  • It’s open-source and cloud agnostic.
  • It’s widely used and has enormous vendor support.
  • It was designed for scaling, replication and service discovery, whereas those things are added via frameworks in Mesos.

Given our situation, Kubernetes felt like the right choice, so we went for it and moved on to getting it in production.

If the above still doesn’t make any sense, this video will give you a quick intro on container orchestration:

Going cloud agnostic

We first tried deploying Kubernetes on AWS, where we’re currently hosting our monolith. To get started, we set it up using the kube-up.sh command. Everything seemed fine until one day our cluster broke — turns out writing logs to small AWS EBS disks will eventually break the system.

Lesson learned: on AWS you really have to manage your Kubernetes cluster, and you’re going to need dedicated DevOps time for that.

Additionally, since Kubernetes isn’t integrated with AWS, we had to resort to CloudWatch/Lambda hacks for automatic scaling, and in-cluster Elasticsearch for pod logging.

Around this time we more formally decided that we wanted to be cloud agnostic. To prove that we were not locked in, we deleted our broken cluster on AWS and started a test cluster on the Google Cloud Platform (GCP).

By doing so, we gained the ability to scale horizontally in terms of servers and our pod logs were automatically scraped and centralized in Stackdriver Logging. Google Container Engine (GKE) gives us a one-click-go cluster, so we’re happy for now.

So are we now locked in at GCP? Not really: automatic horizontal scaling at GCP is a nice bonus, but that doesn’t mean that we’re married to Google. We can still decide to move our cluster away from GCP since Kubernetes can live anywhere. Sure, it might mean a little more effort to keep automatic node and pod scaling, but it’s definitely possible. And with Cluster Federation coming up, it will soon be possible to run Kubernetes over multiple cloud architectures. So while Kubernetes did nudge us in the direction of GCP, we’re more cloud agnostic than ever.

Now that we’ve got orchestration and our basic infrastructure in place, let’s quickly loop through some important packages and tools that we use.

Communication between services

Our first implementation of microservices communicated using REST over HTTP. This introduced considerable overhead, because every request requires its own connection and handshakes. To improve the speed and user experience, we moved away from REST calls and went for remote procedure calls (RPC) instead. We use gRPC to make the calls, Protocol Buffers for serialization and HTTP/2 to make sure we have persistent and multiplexable connections.

An additional benefit of using Protocol Buffers is that the communication between services is well-defined, reducing the risk of compatibility issues. Communication between the front- and back-end is done through a gateway that acts as a reverse proxy, translating RESTful JSON into gRPC calls.

One issue we had here: Go had a regression in 1.7, forcing to you use HTTP/2 if you want to use SSL. It required some certificate management and a bit of trickery to make this work. We used Kelsey Hightower’s Kube Cert Manager to automate certificate retrieval and renewal from Let’s Encrypt, making it virtually effortless to manage SSL certification.

Logging and monitoring

Currently we use Stackdriver Logging for logging purposes. It works out of the box and fulfils our needs but if we ever run into issues with it, we’ll probably fall back on Elasticsearch and Kibana, which we’ve used in the past.

Stackdriver also provides an out of the box monitoring solution for instances and pods. However, we opted for Prometheus and Grafana, which allows us to additionally monitor services too.

All services expose their relevant metrics — CPU usage, garbage collection details, gRPC calls with response times, etc. — for Prometheus to scrape. We then use Grafana to create overview dashboards for our office displays and more in-depth dashboards to spot issues quickly.

We don’t have specialized distributed tracing software, like Zipkin, in place yet. At the moment we only have one layer of services communicating with each other in our MVP, so there are no long chains of communication to trace. We intend to keep things simple, but if communication between services becomes more complex, we’ll add tracing.

Dashboards make us happy

Version control and deployment

For version control we switched from GitHub to GitLab, since they provide GitLab CI for free, which eliminated the need to use an additional CI system like Jenkins or Wercker.

We have our own dedicated machine which is hooked up to GitLab using their GitLab CI runner package. This machine has access to our Kubernetes cluster through Google’s gcloud command line tool. Since Kubernetes stores its configuration in .yml files, we store all of them in a separate configuration repository and pull them during deployment, so we always have the latest cluster configuration available in our CI.

We think GitLab is great — it’s free and the CI is powerful — the only downside we see is the somewhat slow website. Still, we’re happy with their plug and play solution, so we’re not complaining.

In conclusion

So that’s it for our current infrastructure. We’re interested in different opinions and discussing our decisions, so feel free to leave a message and we’ll get back to you.

This post is part of a series by our team covering:

  • Our Web Platform (Part 0)
    Explaining the context for our stack change and objectives for our new architecture.
  • Our Back-end (Part 2)
    This post will discuss what architectural choices led up to us switching from PHP to Go.
  • Our Front-end (Part 3)
    React/Redux and its implications on building a brand new front of house.

→ Enjoyed this story? Follow Poki to stay up to date about future posts!

Let us know what you think by clicking the 👏 below or leaving a comment.

Poki's blog on culture and engineering.