Microservices: We all know “why” but what about “how”? — Part 1

Published in

Ordergroove Engineering

13 min readDec 7, 2018

Follow our journey into the wonderful world of migrating from the well known monolithic architecture to the fabled Garden-of-Eden-microservices architecture.

Sugar coating aside, we know the starting recipe of most applications: one load balancer, two web servers, one database, one or two engineers to support it and voilá — your application is born and supporting clients. But as your clientele expands, your application grows with new robust features to support new use cases; your team grows to build these features; other departments also expand; the code base grows at a rate that no single engineer can manage entirely. Rigidity and fragility become more difficult to identify and prevent.

Everyone’s talking about microservices. They seem to help prevent rigidity and fragility by enforcing a high degree of decoupling at a system level. Interesting…

This undertaking, however, forces us to start taking into account the complexities of a distributed system and the CAP theorem. We move through our lives making all kinds of assumptions, some more accurate than others. Software engineering is no different. We assume the database is there; that the network is reliable. These aren’t assumptions you can take for granted the more you decouple different parts of your system.

Distributed systems have a tendency to challenge the assumptions of network reliability. Can we build software that takes into account these kinds of failures? Here’s a talk that deeply resonated with me by Andrew Godwin at PyCon Israel 2018 where he discusses the idea of coding for failure. This is a topic I’ve heard a lot about over my career, but this talk for some reason made something click (and probably worth exploring in a different post). Good timing! We can get away with a lot in the monolith (sinister connotations intentional), but coding for failure is something we should be doing, even in our monoliths. I digress. Microservices…I hear they’re great! One might distill the many articles out there championing microservices as follows:

Your one application should be many applications talking to each other in a grand symphony of connectivity, with graceful degradation of applications during failures, and self contained data integrity with eventual consistency! Your teams will be able to innovate faster, deploy faster, and everyone will be so much happier!

Truth time — there’s one part naivete with a dash of sarcasm there, but this should be familiar to anyone that reads about what microservices can provide. And whoa nelly, doesn’t it sound great!

So we have a strong sense of “why” microservices are useful, but as we began our research about this “new” approach to application development there was something that was extremely elusive which I realized quickly: no one can tell me HOW the hell do you go from point A (monolith) to point B (micro-lith)? That is to say: I have an existing system that supports existing clients. I can’t just halt everything or interrupt service. The team is still growing; development and deployments are getting more cumbersome. Something needs to change.

Before we get to “how” let’s briefly discuss “what”

It’s been established why we want to move to microservices. That doesn’t mean we can go into the existing code base and start hacking away at things willy nilly. We need to define what responsibilities the different microservices will own. Even before we get to multiple microservices, an analysis can be done to determine the responsibilities of a single microservice.

How do we do that?

Hmm — we’re already back to the question of how…that was fast! Two guiding principles that have been invaluable in our research revolve around the idea of “context boundaries” and refining problem statement definitions.

Context boundaries require us to have a clear definition of what things a service does and owns and more importantly what it doesn’t do and doesn’t own. There’s no cookie cutter answer here. You know your system; you’ll have to determine your own context boundaries. Fortunately this is just another flavor of responsibility separation, which anyone involved with object-oriented programming should be familiar with. This term was brought to my attention in the CONTEXT ** wink wink ** of microservices, but I’ve adopted it to help with project prioritization, code I write, and most things in between: what is and what isn’t “your” responsibility?

Generally speaking, context boundary definitions can be boiled down to a kind of probability analysis: “Do a set of things/entities in my system tend to change together or not? If they do, there’s a good chance they’re within the same context boundary.” That’s about as “cookie-cutter” as we’ve been able to get with this. We’ve found discussing the same sections of our platform with different people still yield different results. It’s partially subjective, partially trial and error. I never said this would be easy… ;-)

Our second principle is something we’ve talked about in some of our other posts: problem statement definitions. I’m a big fan of clearly defined problem statements everyone can agree on. “What’s the problem we’re trying to solve?” To me, they’re more important than requirements. Both require refinement, but problem statements, when curated well, tend to encapsulate broader implications than those of a specific feature’s requirements. When a feature request comes in, a solution for that particular feature can be implemented, but as discussed in other posts, sometimes multiple features that seem disparate are actually part of a broader, more abstract problem statement.

A trap we fell into: How != What tools should I use?

For most engineers, myself included, we learn by getting our hands dirty. New language? New tool? New thing? Get in there; break things; learn from your mistakes; challenge your assumptions; get that feeling of success when something works as expected AND you understand why it works.

In this vein, something very commonly juxtaposed to microservices is containers. Many folks once-upon-a-time-not-too-long-ago started making the move from dedicated servers for hosting to virtual machines. We now find the industry in a similar migration from virtual machines to containers. Given the pairing of the architecture and the infrastructure tool in most conversations, we asked ourselves “Should we start the move to containers with this initial application dependent microservice?”

We want to move to containers!

OK. Microservices check list — point 1 — ✅!

But that quickly led us to another thought:

We might need to move to a different data center…

Hmm. OK…anyone who’s done a data center move knows how complicated it can be. If you haven’t…well…just take my word for it: it’s really hard. This isn’t quite the same as a full blown data center move though — it’s only a part of the existing application. So there must be parallels but because it’s less than the entire platform so it must be simpler…right?

As we pulled on that thread more we naturally landed on:

We need to test network latency between the two data centers!!

We have an existing application. It lives in data center “A.” We’re going to put our new microservice in data center “B,” given that “B” has better container support. Now we need to know: how long does it take to travel between A and B? Is that going be acceptable for existing SLAs? How will response times be affected? Hmm — more complexity on top of all of the complexities around the previously mentioned CAP theorem, new “failure aware development” we’d like to enact, and “mini” data center move.

:-|

There are tools that can help you! You’re in microservices land and in microservices land you can use whatever technology you want! It’s so liberating! You can use HTTP 2.0 with open connections and multiplexing channels! Other microservice implementations solved this with gRPC!

But…how do long-lived open connections work with a load balancer? Also, I can’t expect my existing clients to migrate to a new protocol. We move faster than they do and quite frankly: I thought this was going to help me innovate and move faster! I’m finding the only thing I’m getting is dizzy.

We spent some time exploring and measuring different frameworks which leveraged different underlying protocols to determine what would provide the lowest latency, response time, and highest throughput such as Django (which we predominantly use in most of our applications), Flask, and aioHTTP to name a few — there were others. We started with Flask and just kept with it because we found our latency timings to be “good enough” for the time being. This sense of comfort was short lived.

WHAT ABOUT DATABASE LATENCY BACK TO THE OLD DATACENTER????

You might say at this point “Hold the phone here…why on earth would you build a microservice that doesn’t have the database right next to it. That’s the way this stuff is supposed to work! The microservice acts a gateway to any data other micro/macro-services rely on, but don’t own.” Totally! But…the application is running 24/7. Similar to the data center move, a database migration, especially of the primary database, is especially non-trivial. It’s one of the most sensitive and complicated aspects of the aforementioned data center move. Besides that, the application is actively in use. Using my friend’s analogy: The car is rolling along the highway and you need to change the tires without stopping. It’s probably in everyone’s interest that you surgically change one tire a time as opposed to all four at once. Bad things are likely to happen.

We did our due diligence here as we did with different frameworks. PeeWee vs Django vs SQLAlchemy: again, load testing, measuring, comparing, discussing, not to mention trying to quickly learn the inner workings of some of these frameworks to answer import questions when you have an application that relies on a database elsewhere: How do we configure this framework to use persistent DB connections? Is there connection pooling?

So — wait…wait wait wait…how does this get us to microservices?

The “how” turns out to be independent of microservices

Yeah! How does all of these tooling and technology get us to microservices? Short answer: it doesn’t. Like any tool, they can’t solve any problems for you on their own: it’s all about how you use them. We kept returning to our problem statement and refining what we were doing by asking: “What problem are we solving? What problem do we need to solve right now? What’s the most important problem we need to solve?”

Our team prides itself on implementing quality solutions and features by spending a good amount of time refining problem statements and ruthlessly prioritizing what we’re doing based on those discussions, the short term needs of the business, and the long term vision of the platform and architecture. Do I need to solve all these round trip latency variations? Do I need to use containers or VMs? None of these questions matter right now. They will. Right now though, they are just noise. The most import problem to solve in the monolith in order to move to microservices is: How do we remove the assumptions that the current applications make regarding access to a particular data source?

At a high level: microservices have absolutely NOTHING to do with the problem statement mentioned above. The problem statement just makes that move easier. Here’s how: if you have a function that makes a call directly to the database, you can replace it with an interface that provides a way to request data that is “source agnostic.” You can (should) also make the different data sources pluggable (i.e. “database” vs “ web”). Now, when you’re ready, you can plug in a new data source.

This is the most powerful and useful thing: the code base using this interface shouldn’t have to change. It no longer assumes source. It must conform to the rules of the interface to manage data, which is independent of the source of the data. You wanna use HTTP 2.0? Go for it. gRPC? Go nuts. Write the protocol, plug it in, app still works. (Let’s just gloss over the assumption I made about “things just work” and hope there’s some things in the application that were coded for failure…nothing to see here…move along…)

We also recognized that this was the most complicated problem to solve. There’s about 7–8 years worth of code we’re talking about here. The other problems we explored can be solved with tooling, especially if we start breaking apart those assumptions in a way that afford us the wiggle room to pivot tool-wise whenever necessary. Part 2 of this series will elaborate in greater depth on how we’re starting to pull off that magic trick.

How did we get out of the rabbit hole?

I don’t mean to give the impression that we were in a kind of “analysis paralysis,” spending inordinate periods of time researching and analyzing. We spent a couple of weeks down the rabbit hole, but it was extremely useful. We learned about kubernetes and container deployments; we helped our DevOps team gain more insight as to what it’ll be like moving teams and their applications to containers, which is going to happen independent of microservices. Also, no one ever said microservices can only exist in containers.

It might also help to elaborate more on our team’s process. We’re an Agile shop, but Agile is a tool — it’s all about how you use it. We don’t sit around a conference room or video chat (more than half of my team is remote) debating indefinitely on our subjective opinions of what a more refined problem statement could be. It’s data driven. “How do we learn about this as quickly as possible?” Fortunately, Agile prioritizes learning. Our goal isn’t to get it right on the first try with no intentions of revisiting previously made decisions. This is the fear that causes analysis paralysis. Again, the goal is how do we learn as fast as possible what to do or what not to do.

This continued focus on learning helped us navigate down the rabbit hole and, arguably more importantly, back out of the rabbit hole. If the plan is to have a data source exist elsewhere, the fastest way to learn about it is: put a data source elsewhere. Sounds obvious, right? Maybe, but I’ve read enough horror stories about projects that are curated and perfected in a lab setting and then when released into the wild, crash and burn…miserably.

Here’s the summary of our first three weeks:

We’re going to need to change all the code that depends on this particular data set and depends directly on those database tables. We’ll need to do an audit to find and understand all the usages, but first…
In order for the application to talk to a different system for a data source, we need a different system up (infrastructure, containers)
We need to be able to deploy changes there (Jenkins, kubernetes)
We should get a baseline idea as to how long round trips might take (network latency)
What do other tools look like against a generic throughput benchmark?(frameworks)
How long does it take to actually read some data with these different frameworks? (database latency across data centers)
Let’s hook this up to a reasonable traffic, low risk spot in the existing application
Hey, look at this other part of the application? That’s going to be tricky — we might need to refactor parts of this. For now, let’s see how the application behaves with this tweak.
Hey this other spot looks like it might need to be refactored…and this other spot…and this other spot…wait a minute…
Remember that audit? I don’t think it’s going to be an audit of “how is this used?” We know how it’s used, this is going to be an audit of “what is the range of complexity you are going to experience breaking the assumptions of the existing application to a particular data source?”
Oh boy…

By generating data we could examine and challenge assumptions, refine the problem statements, and iterate. The questions above regarding network latency, DB latency, and so on. were not neatly laid out the first time we sat down and said we wanted to do microservices. That took time. Some we knew would eventually come up, but that becomes a question of “when” not “what.” We thought about where we were, where we wanted to go, and bit by bit, we let our questions guide us. The first time we sat down the question was “Where do we begin? This is a massive undertaking…” It’s still a massive undertaking. There was and still is a beautiful mixture of excitement, fear, dread, curiosity.

So what’s next?

Maybe we won’t need to deal with all those problems we thought were important; maybe we will; we’ll probably need to deal with a subset. When that time comes, we’ll have at least broken the data source assumptions in the monolith. Turns out there’s a happy accident/byproduct that arises from this approach: we may find that there are not global latency problems. They may only exist in subsections of the monolithic application. Depending on how you break down the assumptions, it stands to reason that you should be able to tackle those issues on a case by case basis. The next part in this series will shed some light on the details of how we’re trying to ensure we have the flexibility we need, where we need should any emergencies arise (and they most likely will).

All that said, we actually already have a handful of microservices in our platform. They have separate databases; they own specific data sets that others rely on and provide access to when necessary; they can be deployed independently; they provide specific sets of functionality independent of one another; they do so with HTTP 1.0. Granted, they’re all in the same data center, but still — we’re doing it! Speaking from my own personal experience — don’t get caught up in the “micro” of microservices. If you have two separate systems that own different responsibilities and can communicate with one another, then you’re already doing it.

My next questions to you would be: What’s the most important part of your system that generates value that would benefit from greater independence? What does your team think gets in the way and would be helpful if it lived someplace else? What’s the most important problem you need to solve right now? Figure out your problem statements and let them be your guides.