A Journey to Kubernetes

Published in

Engineering @ Wave

8 min readJun 21, 2023

Preface

Our journey starts in a world of mess. Like many other growing software companies, we collected partially completed upgrades and migrations resulting in many mixed solutions that are all in a state of needed attention. Wave’s application deployment platforms need some love.

We have collected a hodge podge of versions of our legacy solution. An off the shelf ECS-based platform as a service on top of which we had built a lot of custom tooling. Deployments take forever and recovering from failed deployments takes even longer. Our costs for this tool are about to go way up and support is dwindling as we fall further and further behind in our upgrades.

We have known for some time that we are outgrowing our current environment. We have tested the waters with a variety of products and alternatives, some of which made it to production. One of these solutions was a kubernetes (EKS) solution that opened our eyes to the possibilities and challenges that are available in maintaining a kubernetes environment. With all the solutions in the pool, Wave’s OpsEng team is struggling to keep up with demand.

It’s time for change.

Consultations

We start off by consulting developers. As you would expect we came back with a laundry list of wants and needs. Some people wanted a pie in the sky and others wanted to leave everything just the way it is. More importantly we came out the other end with a newfound connection to the dev teams.

Not everyone’s ideas would make it to the final product. We were there and we listened. The developers felt heard. We compiled our notes and came up with our wish list.

Research

With a dream in hand we set out to discover a new solution . We quickly realized how many options were out there.

Upgrading our legacy solution was at the top of our list. Wouldn’t it be nice if a simple upgrade could fix our problems? In our testing of the latest version of our current platform we found a lot of change for a similar product. Many of the limitations remained. An upgrade will not improve our ability to move at-pace.

Through our investigations the way forward became clear. Our new solution would include some sort of Kubernetes solution built on top of EKS. We all signed up for a Certified Kubernetes Administrator (CKA) course. We leveled-up our Kubernetes knowledge as a team to make the path forward possible.

An Amalgamation

The Platform team at Wave was structured with multiple sub-teams. Within those, the DevSystems team was responsible for developer tooling and CICD while the OpsEng team was responsible for infrastructure and operations. This divide had been a struggle for Wave’s long-held vision of a DevOps culture. Our path to Kubernetes started out as an OpsEng team project. It was clear that we needed to break down these barriers to make this project succeed. We merged the DevSystems and OpsEng teams together to make one DevOps team. It didn’t take long to realize this was a great move. The expanded team gave us a greater understanding of the Wave ecosystem. We had a wider breadth of experience to lean on when making decisions. Perhaps most importantly, everyone felt empowered to collaborate with ease and learn from one another.

Proof of Concept

We started working on our ground-up solution. This would be a proof of concept development environment. We would build our own deployment pipelines,support branch deployments and have database snapshots pre-filled with sample data to use in testing. Our 4 environments used for development and staging would get retired. In production we would have canaries and blue/green deployments. It would be magical and the sun would shine every day.

As we built out our dream solution we discovered a lot of things. We learned how integrated our current legacy solution was to our applications. There were hard-coded assumptions throughout the code base that needed fixing. Our containers were not idempotent between our environments and required rebuilds. This would all need fixing before we could move forward.

We spent months developing and iterating trying to get Wave to work on our shiny new PoC called “makara”. Makara is a completely new development environment built with EKS from the ground up. We eventually came to realize that we were wasting our time. We continually stumbled on creating net-new databases, setting up VPCs and security groups, managing third party allow lists, etc. There were so many supporting changes that needed to happen. We were spending so much time dealing with the intricacies of duplicating Wave that our dream platform goals had taken a back seat. We decided to cut our losses, close out our PoC and move on.

Ok, Let’s Party

Photo by Pineapple Supply Co. on Unsplash

We would proceed by building our solution on one of our development environments. We picked one that gets infrequent use so the developers wouldn’t miss it. The nuances of Wave are already configured and we can focus on the project.

We proceeded with integrating ArgoCD to reduce the amount of work we needed to do. We began by working on tooling and getting apps migrated over. With new budgetary time constraints we focussed on an MVP. We cut out a lot and moved it to our “tech debt” pile.

We built migration runbooks to assist us in migrating applications. We built new cluster runbooks. The runbooks would be our migration bible for iterating through Wave’s environments.

We mobbed on hard challenges once a week. Mob programming is a technique where a group of collaborators work together in real time on one task. Mobbing helped us to level up and share knowledge across the team. The day we mobbed on running our end-to-end tests against our first completed environment was an exciting one. Dan (manager of DevOps) jumped right out of his chair and screamed in excitement.

We migrated many of our environments efficiently with the aid of our runbooks. We avoided mistakes and had consistency in our changes.

The Eye Opener

Part way through migrating Wave’s environments we started getting complaints from developers. They were frustrated and upset with being unable to test their code changes. They claimed they were down to one legacy environment to test in. Wait, what? In our minds we had released several Kubernetes-based environments and they had been in use for weeks. What happened?

Developers didn’t realize that they could use the recently released environments. They thought that they were still blocked for use. We were nearing production readiness. Due to our failure to communicate no one had tried out our new developer tooling yet.

We re-communicated and found out we had a lot of work to do. We had been so focussed on team-oriented documentation and making progress. We completely forgot to write any developer-oriented documents. Our developers were lost and didn’t know how to deploy, look at logs or run database migrations. They were like fish out of water.

We split into teams on a mission. We made it our top priority. We divided up the work to document and make these Kubernetes environments usable. Within a week developers were up and running in the new Kubernetes cluster.

This was a huge eye opener for us. Early feedback and developer input is very important to us. We wanted to have free-flowing communication channels. Instead we ended up with an episode of Trading Spaces gone wrong. We made changes to how we were communicating with the developers. We started being more specific and clear. We involved managers to ask for time from teams. Engineering management began promoting us and became the biggest advocates of our work.

Ship-It!

Well it’s finally time to roll up our sleeves and migrate production. We decided to migrate our staging environment and production together. We would migrate app-by-app. This would give us immediate practice before the production migration as well as a way to reproduce any issue that may arise.

There were some tradeoffs to this approach. We would be blocking our most prod-like environment for the duration of the migration. Developers looking to test certain features would be unable to. Two environments means it would take us double the time. Well worth it to have a smoother transition in production.

We split into 2 teams to migrate 3 applications per week. We would have one month of leeway at the end to meet our timelines. For each application we planned a zero-downtime cutover of stage, followed by the same in production. We met on zoom with 4 DevOps staff and 1 or more application developers on the call. We had 5 responsibilities:

Driver: Inputs commands and writes code.
Navigator: Leads and directs the team’s actions.
Scribe: Meticulous note taker to document the migration.
Runner: Runs communications outside the migration team.
Application Support: Application developers. Subject matter experts on the application and its code.

This worked very well. We could combine some of the support tasks if someone was on vacation or required for other work. For example, the navigator could take over scribe duty. Similarly, the scribe could perform runner tasks. We always had enough quorum of staff to handle the migration as it was happening.

Having application developers on the call was a godsend. They could test as we migrated each service and worker to verify success. They had domain knowledge to identify risky areas and highlight things that we would not have otherwise known about.

At the end of the project we finished successfully and on time. This was a huge success and a milestone victory for the team. We learned a lot and came out successful on the other side.

Key Take-Aways

Photo by Glenn Carstens-Peters on Unsplash

There are several key learnings that we had during this project.

Consult:
– Consult developers early and often to foster relationships and set expectations.
Scope:
– Tightly scope proofs of concepts around explicit goals.
Communicate:
– Communicate clearly and get managers involved.
Document:
– Documentation needs to be from multiple perspectives.
– Keep detailed documentation of your rollout for incident response.
Overstaff:
– Err on the side of too many people on migrations and always include application developers. Better to have someone bored than missing.

What’s Next?

Kubernetes has unlocked the floodgates for DevOps at Wave. We are already testing out Canary deploys. By reusing container builds we ship exactly what we test. We are currently interviewing engineers as a part of reimagining our developer experience. We dream of a day where developers are making meaningful and impactful code changes that ship to production, all on day 1.