Moving a platform into Kubernetes :

Steve Brunton
Yik Yak Engineering
5 min readJan 30, 2017

Previously we discussed auto scaling within AWS and programming language selection for a re-platforming of the Yik Yak backend. These were precursors to discussions of moving from platform services being deployed on individual EC2 instances and instead being deployed into Kubernetes. There was a process involved though of first proving that containers were a valid solution and then a migration of services into Kubernetes over time. This is that story.

Containerize things :

CodeDeploy was working, but it wasn’t efficient once the development lifecycle of the existing platform began to speed up. Deploys would take a while, autoscaling was slow and rollbacks were cumbersome. An AMI baking service was created to utilize Packer as a mechanism to build AMIs so that the core codebase would already be on the machine instance at boot time. This allowed a simple wrapper script around Chef Solo to configure the application for the environment the machine was booted into. This was an acceptable solution until we started running into AWS API request limits and images that couldn’t build.

By middle of Q1 2016 we had our first new platform service written in Go and ready for deployment into QA and then Production to support a new feature of the product which would allow Yaks to be shared between users. This was a perfect candidate to be built into a container and deployed. To prove that this would work, an AutoScale group was created in AWS with Terraform. Ansible was used to deploy new versions to the hosts. This worked fine and we proceeded with the next new service. A more in-depth blog post about building and deployment will be forthcoming.

Our second service, responsible for SMS notification pushes, was built into a container and started to show the issues mentioned with AWS Auto Scaling in another post. It ran fine on a couple of AWS EC2 instances with normal everyday use, but now that we had a better and more performant service we wanted to start sending more targeted notifications out to users. As the bulk payloads increased, the autoscale group would spin up an instance slowly, bootstrap down the docker image and start to process the queue. This would repeat, slowly, until we hit the max number of instances in the queue and it would chug along. The problem with this is that it would take a long time to scale up and the push notification weren’t being delivered fast enough around specific timed events. Of course the solution is to spin up two instances at a time, pay for the minimum of an hour usage for two or four instances and then scale down. This was suboptimal and just reiterated the downside of AWS Auto Scaling.

Toss it into Kubernetes:

By now we had gone through a couple of iterations of building up and tearing down some Kubernetes clusters to find the best fit for what we were planning to do. A cluster was created using Terraform and Ansible for both QA and Production, with QA utilizing namespaces to differentiate between dev environment and qa. The Production cluster had it’s own namespace as well, but was not a shared resource with other environments. We built a tool (which is the subject of a later blog post) to deploy services on the appropriate cluster and in the correct namespace, along with necessary config information for the environment.

It all timed itself right for the start of second quarter of the school year and a very large push notification campaign. This was one of the primary driving forces to get Kubernetes up and running when we did, because we knew that the existing auto scaling of the notification system would work, but it would take a long time to catch up given the size of this campaign.

The campaign kicked off and notification messages started flowing. Where in AWS we limited the auto scaling to six hosts, we configured the Horizontal Pod Autoscaling in Kubernetes to max out at ten. We watched the system almost instantly scale up to eight pods and then later max out to ten as it processed through the campaign. Had this been running on AWS instances, it would have taken around 15–20 minutes to scale up to full Group capacity.

OK, now what?

Once we had proven that Kubernetes works and that it works pretty well, we started to dig in on all the little details. Things like logging and monitoring along with how to integrate current services and new services into a healthy hybrid relationship.

In solving the logging and monitoring pieces we came up with some very opinionated choices (which will be discussed in a later post) in how we named clusters, environment and services. This allowed us to add attributes on Kubernetes Nodes that would match labels on application pods to be able to correlate data throughout the infrastructure. For example, custom DataDog and Fluentd services are deployed as DaemonSets in each of the clusters that have these attributes applied, so the SRE team can monitor and be alerted as necessary.

For the integration piece, we created and deployed an nginx reverse proxy that would handle all inbound requests from the client and deployed it. This helped us get better ingress metrics than previously achieved through statsd information sent to DataDog, and also it marked the start of standard logging practices using JSON output to Fluentd and an EFK solution. The reverse proxy started off sending all traffic to the old PHP stack and as new services were deployed into the Kubernetes cluster requests would be routed to the correct service passed on path matching rules. (Such are the acrobatics needed to migrate backends while keeping the services live.)

This initial roll out into Kubernetes in AWS helped us to learn many things about migration, monitoring and logging, and going from services on individual EC2 instances into containers within an orchestration system. These lessons proved very beneficial as we migrated out of AWS and over into Google Cloud Platform which will be discussed later in this series.

--

--