Migrating a Monolith to Google Kubernetes Engine (GKE) — Customer Story

Get Cooking in Cloud

Published in

Google Cloud - Community

11 min readMar 24, 2020

Introduction

“Get Cooking in Cloud” is a blog and video series to help enterprises and developers build business solutions on Google Cloud. In this third miniseries we are covering Migrating a Monolith to Google Kubernetes Engine (GKE). Migrating a Monolith to microservices can be intimidating. Once you decide to take it on, what are things to consider? Keep reading…

In these articles, we will take you through the entire journey of migrating a monolith to microservices, the migration process, what to migrate first, the different stages of migration and how to deal with data migration. Our inspiration for these articles is this solutions article. We will top it all with a real customer story walking through those steps in a real world application.

Here are all the articles in this miniseries, for you to checkout.

Migrating a Monolith to GKE: An Overview
Migrating a monolith to GKE: Migration Process
Migrating a monolith to GKE: Migrate in stages
Migrating a monolith to GKE: What to migrate first?
Migrating a monolith to GKE: Data migration
Migrating a monolith to GKE: Customer Story (this article)

In this article, we are exploring a real world use case with xMatters and how they went about their migration journey from a monolith to microservices based architecture on Google Kubernetes Engine. So, read on!

What you’ll learn

Tips to execute an end-to-end migration from a monolith to microservices

Prerequisites

Basic concepts and constructs of Google Cloud so you can recognize the names of the products.
Check out the introduction of Get Cooking in Cloud series.

Check out the video

Interview with xMatters on their monolith to microservices migration journey

Questions:

I had the opportunity to sit with Travis DePuy from xMatters. Here is the run down of our conversation about their migration journey:

❓Priyanka: What is xMatters and what does the tool do?

🅰️ Travis: We help keep digital services up and running. When stuff explodes, we help minimize the blast radius. We do this by tracking down the current on-call resources and delivering critical information to the right people at the right time.

Here we see part of the xMatters flow designer canvas showing a Stackdriver alarm workflow. The first step is the inbound HTTP trigger that parses the payload and kicks things off. Then we start enriching the alarm with some metadata that Stackdriver might not have. In this case we are getting the last commit and deployment info from GitHub and Jenkins. And finally, last step here is what actually fires the xMatters event and notifies people.

Inbound trigger from Stackdriver and how it is parsed in xMatters Canvas

Now we see Dan receiving the Stackdriver notification on his mobile device. He reviews the stock info from Stackdriver, but then he can review the enriched data from GitHub and Jenkins.

He has a couple different response options which I’ll talk more about in a minute, but he decided this is severe enough we need to roll back to our blue deployment from the current green deployment. So he selects the response option and adds an optional comment.

Back in the xMatters canvas, we see the various flows that kick off when a user selects the relevant response option. Every organization has major incidents and has processes around kicking them off. In this case we see that xMatters will automate creating a Jira issue, then a Status page incident and finally a Slack channel so that the teams can start collaborating.

We also see the Rollback to blue that Dan selected and we see that a new Jira issue is created before triggering the redeploy using a Kubernetes command. The jira issue helps to keep track of who triggered the deployment and when, and then also serves as a record for future investigation.

The last one is my favorite though, using a playbook tool such as Ansible. We can go trigger an existing playbook to attempt to remediate the problem.

There are limitless possibilities and all of this can be tailored to the specific processes of the organization.

❓Priyanka: xMatters provides notifications to the right teams when things occur, like spikes in traffic or error rate.

🅰️ Travis: Yes, any time you have an application that needs to reach a human, that’s where we can provide value. But notifications are just one small piece in a larger workflow. Once you get the notification, you need to go look up other assorted details and then potentially take action depending on severity or other factors. And all of this is manual work, known as toil. It’s critical work, but takes away time from working the issue at hand. In response to this and the needs of our customers, we have built a workflow platform to enrich alarms before the notifications are sent out, but then also provide response options to automate that toil.

❓Priyanka: Does xMatters provide the innovative DevOps teams tool to maintain operational visibility and control in today’s highly fragmented IT environments?

🅰️ Travis: Exactly, We also work across a variety of industries, including: healthcare, retail, manufacturing and finance to support a number of use cases such as: DevOps, SRE, Major Incident and Incident Response.

❓Priyanka:How would you define the difference between a monolithic architecture and a microservices based architecture

🅰️ Travis:Well, think of a monolith like a Turducken, you know that delicious monstrosity of meat? It is a chicken stuffed inside a duck and all that stuffed inside a turkey. A monolith is hand crafted and delivered to the ops teams to then install and deploy. New versions are new artifacts and require new procedures and processes. Heaven forbid you want to change something. You have to pull everything apart and tweak that one thing, then recompile it all into one big monstrosity. All pieces are highly intertwined and highly dependent on each other.

Conversely, you can think about microservices as several different flavors of cupcakes. You have a couple different recipes and if you decide you need to tweak one recipe, make up new batter and deploy into your baking tins. Then you roll them out to your customers, and they all get the benefits of the latest and greatest you have to offer. Each recipe can be tweaked largely independently of the others.

❓Priyanka: Wow! that is the most delicious explanation of Microservices I have ever heard! It seems that xMatters approached cloud initially from the monolith perspective and over time came across challenges that lead to the decision to move to a microservices based architecture. What were some of those challenges?

🅰️ Travis: Each customer was running a very unique instance, which lead to monitoring headaches and debugging was a nightmare. Additionally, it took days to onboard a new customer and it became a very bespoke solution as infrastructure drifted as we released new features and moved off of older operating systems. We also found that our Ops teams were focused on building the infrastructure platform, instead of working with the developers to build the application.

❓Priyanka: So main challenges with a monolithic application were — operational inefficiencies, monitoring and debugging nightmares and long customer onboarding. I am curious, what was the original monolithic state of the infrastructure and application?

🅰️ Travis:We originally were an on premise application, so it made sense to have one big monolith.

xMatters Monolithic architecture before the migration

As our customers started embracing the cloud transitions, so did we and naturally we took our monolith and installed it into servers we physically maintained.

We had space in 6 data centers all over the world, and soon our service mushroomed into 500 servers running OpenStack and a home grown Platform as a Service.

This made for more than 5000 VMs to manage!

❓Priyanka: 5000 VMs is a lot! Is this about when you decided to move to Google Cloud?

🅰️ Travis: Yep, we realized this wasn’t sustainable. We really just wanted to build an enterprise application, not babysit hardware.

❓Priyanka: When you started out thinking about microservices based architecture, what was the end goal in mind?

🅰️ Travis: Primarily, scale. We knew maintaining 5000 VMs wasn’t going to be scalable, and there was so much redundancy. Each instance had its own version of the web ui, its own notification service, its own scheduling service.

Application Gateway routing calls to different microservices

❓Priyanka: Yes, application gateway approach is great when you are new to microservices. And how is this deployed on Google Cloud?

🅰️ Travis:We use Google DNS for IP address lookups, then user devices such as user devices such as browser and mobile end up on the Global Load Balancer. They are then routed through our Application Gateway, which is a home grown application composed of a few different services including consul and HA proxy and our own configuration tool. This makes it easy for us to route devices to the appropriate data in the appropriate region, as well as providing service discovery.

High level view of how microservices architecture can be setup

The main application microservices run in containers in Google Kubernetes. These containers are all stored in Google Container Registry. The data back end is composed of PostgreSQL with Kafka and RabbitMQ, which are running on Google Compute Engine. And monitoring using Prometheus, Splunk and Stackdriver.

❓Priyanka: So, GKE is the star of the show for microservices and GCE for data backends.

❓Priyanka: So, how did you map out a plan to break the monolith, Did you pick out a few services to migrate first?

🅰️ Travis:Well, we started with some of the obvious aspects, such as the various user interfaces, APIs and our integration builder services. And as we went we found other logical separations that made sense to break out, such as the event processing for alert suppression and on-call scheduling. The notification and data processing were also ripe for pulling out.

❓Priyanka: Yes, and the biggest challenge that I have seen in the process for most applications is data dependency. How did you deal with highly dependent services that shared data?

🅰️ Travis: Dependencies are inevitable. Microservices rarely function completely independently of one another. We made sure to release the services in cadence with others.

❓Priyanka: So, basically identifying the dependencies between services early on and then planning a cadence of microservices roll out, that makes sense based on those dependencies. This definitely was a pretty large scale migration, it must have taken a long time?

🅰️ Travis: Actually, not that bad. We completed it in about 12 months. Which is good, because we had data center contracts that were coming up for renewal and it was either spend the money to renew, or get everything out.

With the flexibility of the GCP environment, we were able to do a soft rollout and keep our existing datacenter infrastructure running in parallel, which we did for about 6 months.

❓Priyanka: Yes, that is a very important point in any migration. You are going be running on a hybrid platform for some period as you slowly migrate service by service into cloud.

🅰️ Travis: And, I think one of the biggest contributors to success for us was that once we got the infrastructure in place, we mapped out the monitoring and observability. This was critical for informing us of the impact of our changes and especially, to know the impact of our changes.

❓Priyanka: Stackdriver becomes your best friend for such observability and monitoring. What role did Google Kubernetes Engine (GKE) play in this journey, as you were making these incremental changes?

🅰️ Travis: We use the sidecar pattern which helps us leverage our investment in Splunk as well as our other observability applications. Using the sidecars means we can be confident we can swap other applications in and out as our needs change.

❓Priyanka: what is the most challenging part of running the application in microservices?

🅰️ Travis: Two things, learning the ropes of service ownership and deploying those services effectively.

❓Priyanka: This change must have been new for the team. How did you make sure that the Dev and Ops teams felt comfortable with the changes?

🅰️ Travis: Jez Humble talks about how bad behavior occurs when you abstract people away from the consequences of their actions. So we aligned the teams to the services they built. This let them own the code they built and aligned them with their service level objectives.

❓Priyanka: Yes, that sense of ownership does wonders and keep the teams accountable for their SLOs.

Now that you have gone through this migration process, how has the move proved to be helpful?

🅰️ Travis: Happier customers, which means happier teams and fewer late nights and a reduction in overall complexity of the application.

Empirically, we found a performance improvement of 43% due to the microservice architecture, but also the network provided by Google Cloud. As I mentioned we had happier teams, but we calculated a 60% reduction in incidents!

❓Priyanka: Happy team, happy customers, 60% reduction in incidents and 43% performance improvements! Sounds like a win to me!

What do you see as next steps for the xMatters team?

🅰️ Travis: We will be looking into service meshes. Right now the services talk to each other through the application gateway, so getting them to talk directly to each other and to be a true service mesh might gain us some additional improvement.

❓Priyanka: Yes, service mesh like Istio seems like a logical next step to connect, secure, control, and observe services.

Thanks Travis, this was very insightful!

Conclusion

The main point — migrating a monolith to microservices on GKE is a complex process that won’t happen overnight — planning out beforehand (how will my services communicate — how will I manage data — what to migrate first), helps to make sure this process goes smoothly.

If you’re looking to migrate your existing monolithic platform to the cloud, you’ve got a recipe from from xMatters! Stay tuned for more articles in the Get Cooking in Cloud series and checkout the references below for more details.

Next steps and references:

Follow this blog series on Google Cloud Platform Medium.
Reference: Monolith to Microservices on GKE solutions.
Codelab: Migrate a monolith to microservices on GKE
Follow Get Cooking in Cloud video series and subscribe to Google cloud platform YouTube channel
Want more stories? Check my Medium, follow me on twitter.
Enjoy the ride with us through this miniseries and learn more about more such Google Cloud solutions :)