PlanGrid’s migration from Heroku to Kubernetes

Published in

PlanGrid Technology

7 min readJul 26, 2018

When PlanGrid was first starting out, we were focused exclusively on rapidly delivering product and proving out our business model. Heroku was a great fit at that time, because we did not have to worry about provisioning servers or dealing with CI/CD pipelines.

However as we grew and our needs changed, we ran into several problems:

We had to split our infrastructure between Heroku and EC2, which required us to poke firewall holes
Breaking apart our monolith main service was impossible, because of the routing delays and also lack of SSL termination capacity; we ended up having to add HTTP request pools and threads into our apps in order to work around this, and were never able to completely fix the problems.
Needed finer-grained control over data storage location and network request routing to support global build-out

(these three problems discussed in further detail below)

To solve these problems, we spent a lot of time investigating and then spent nearly a year migrating to Kubernetes running on EC2. We also implemented continuous delivery pipelines using Netflix’s Spinnaker tool in order to ease the transition over to Kubernetes.

The problems

Split Infrastructure

Running our heavyweight async workers on Heroku was too costly, so we ran these workers on EC2 instead, which resulted in our apps being split between Heroku and EC2. This split required us to maintain separate deployment tools for Heroku and for EC2, and it prevented us from developing tools and processes that applied equally to apps deployed in the separate environments.

Splitting our infrastructure between Heroku and EC2 caused serious security problems. Our PostgreSQL database server was too large to fit on Heroku Postgres, so we ran it instead on AWS EC2. We then had to establish a means for Heroku-based apps to communicate with the EC2-based Postgres server. Because Heroku cannot tell you what IP address your app will be running on, it was impossible to lock down ingress to our Postgres server based on IP address; We ended up running our Postgres server fully exposed to the internet (all connections encrypted with SSL and using Postgres passwords). Quite apart from being a potential security vulnerability, this exposure became a logistical problem when it came time to become SOC 2 certified, because we were unable to make blanket statements like “all of our servers/databases are not directly accessible from the internet.”

Whenever there are special-case exceptions to things like firewall rules, it becomes exponentially more difficult to reason about your infrastructure and to ensure that it has been securely locked down. And it was impossible for us to make blanket statements in response to customer security questionnaires as long as our infrastructure was split between Heroku and EC2.

Routing

Every HTTP request to our Heroku apps traversed Heroku’s custom HTTP Router as well as their SSL termination service. The router and SSL termination introduced highly variable and totally opaque latency to all HTTP requests. This added latency proved untenable when performing service-to-service communication between Heroku apps, because the latency became additive at that point. When it came time to add lots of service-to-service communications into our architecture, request latency made it impossible to stay with Heroku.

Here’s what our request queuing looked like under Heroku:

And here is what it looks like today, running on Kubernetes:

Fine-grained control over network/data

As our customer base grows, we will eventually need to expand our datacenter footprint to keep latency down for all customers, regardless of where they are. Heroku does provide an ability to launch applications in global (ex-USA) data centers, but we wanted more control than what Heroku offered. In particular, we needed to be able to simulate a region failure, which requires us to have fine-grained control over request routing and the networking internals, which Heroku intentionally shields from view.

Getting Started

In order to decide how to proceed with our migration away from Heroku, we spent 6 weeks trying out various deployment tools and cluster managers and eventually settled on using Netflix’s Spinnaker continuous delivery tool in conjunction with Kubernetes. Because we would completely own Kubernetes ourselves, we would have enough network-level control to solve the three problems driving us away from Heroku.

However we wanted to give developers a better experience than what would be provided using just the kubectl command line tool, and also wanted a way to automate our deployment pipelines. Spinnaker provided a solution for both of these problems. A future blog post will delve into the tools we added in order to keep developers as productive as they had been on Heroku.

Why we like Kubernetes

When we were looking around at the landscape in early 2017 trying to decide what our path forward would be, it was clear that there were basically only two viable options: Kubernetes or Amazon ECS. We had had some minimal experience running trivial apps in ECS (with not great results), so I decided to spend one of our bi-annual hack weeks playing around with Kubernetes. I was impressed with Kubernetes’s feature completeness, even 18 months ago. Since we were only looking to run stateless apps, it met our requirements and was easier to deploy and manage than ECS.

In particular, Kubernetes made it easy for us to launch a proof-of-concept app and expose it as a public-facing HTTP service using just a few CLI commands. This is because Kubernetes itself creates and manages AWS ELBs, which is unlike ECS, where the ELBs are managed separately from the app itself. Also the kubectl Kubernetes command line tool was vastly superior to the very generic tools provided by the awscli all-purpose AWS command line tool that’s used to manage everything in the AWS universe.

We’ve been quite happy with our choice of going to Kubernetes. Kubernetes appears to have won the container war, with Amazon releasing their EKS Kubernetes solution. The growing acceptance of Kubernetes as the de-facto standard for running production containers has allowed for the amazingly rapid development and adoption of the istio service mesh, which we are in the process of implementing.

Why we like Spinnaker

Pipelines

Spinnaker’s job is to take a deployment artifact (e.g. a git commit or a Docker image) and deploy that artifact. In order to accomplish this, Spinnaker has a JSON-based workflow language which has a notion of pipelines, stages, and tasks. As a developer, you declare your desired deployment flow by building out your pipeline using the Spinnaker web UI. Each pipeline is comprised of one or more stages (executed serially or in parallel).

Here’s an example of what a pipeline looks like in the Spinnaker UI:

By chaining together stages, developers can model almost any deploy workflow. A particularly flexible stage is the “Run Jenkins Job” stage type, since it allows for executing arbitrary code as part of a deployment pipeline and integrates with existing Jenkins jobs for running test suites.

Our Release Operations team is currently adding a pluggable post-deploy step to enable developers to execute jobs after their apps are deployed, such as running a cypress test suite against their newly deployed apps. RelOps was able to do this because of the built-in Spinnaker/Jenkins integration.

“Just Enough Ops”

In addition to providing an easy way to deploy code, Heroku also provides a web UI for managing applications in production. We wanted to allow developers to be able to make certain limited changes in production while also not giving them full access to the kubectl Kubernetes CLI tool, which we felt was too permissive and also not a great developer experience.

Using the Spinnaker web UI, developers can scale up capacity, gracefully restart their apps, or roll-back to a specific release. Because Spinnaker does not attempt to be a comprehensive Kubernetes web front-end, its UI is much simpler and more approachable than more full-featured competitors such as OpenShift. We viewed this UI simplicity as a very important feature in its own right.

Active development

Spinnaker has a very active community and has dedicated engineers from Netflix and Google (and many other companies) improving it all the time. There’s at least one company [ armory.io ] offering a hosted/supported version, as well as meetups in San Francisco.

Spinnaker recently added support for automated canary analysis, which we’re looking forward to playing around with.

How it went

It took us the better part of a year to fully transition off of heroku and onto Kubernetes on EC2. Future blog posts will delve into the various challenges we faced during the move. In the end, we were able to pull off the migration with minimal customer disruption. Our developers are much happier now on Kubernetes, and we’re able to move faster and with more confidence.

In future blog posts, we will dive into more detail around what alternatives we rejected and the various components of our new, Kubernetes-based deployment pipelines.