K8s all the things!

We recently migrated our whole app stack to Kubernetes. Here is how it went (spoiler: very well).

Published in

Careship Engineering

5 min readOct 28, 2019

Here at Careship we run and maintain several services that altogether compose our application stack (frontend apps, backend services, utilities, our main website, etc.). Before the migration, some of those services were running on ECS while some were manually managed on EC2 instances.

At the beginning of Q3 2019 we evaluated the situation and decided to migrate all of them to be managed by K8s. This is our story from the trenches of why we decided to migrate, the strategy adopted and the learnings we got on the way.

Why did we take the decision to migrate?

Honestly, things were quite cluttered. Although we were using docker images and ECS as the orchestrator, it was very difficult to scale our delivery rate, as we had no proper delivery pipelines.

Engineers had to spend almost 4–5 hours of their time to make sure the releases were as we expected them to be. Rollbacks were difficult as well.
There were challenges with updating ECS services. Also, we were not very sure about what was happening behind the scenes.

It is also worth mentioning that our application stack was constantly growing and ECS costs started to become a relevant factor. We estimated that with only a couple more services in the stack, and we would had to overspend our budget to run them with ECS.

K8s then seemed the obvious choice to a) bring down infrastructure costs and b) streamline the management of our running services. We also had the right knowledge in our team, so all checkboxes were ticked.

We initially estimated the migration to take not more than a quarter. We now can say that the estimation was fairly realistic, since we started at the beginning of Q3 and finished by the beginning of Q4.

Which strategy was decided on for ensuring a smooth migration?

We wanted to build the basics right, so we first tried to figure out which services were more critical to us and our product.

Once we were done with this assessment, we picked up the most important service and migrated that first. This helped us a lot to learn, as we had to make sure we were doing things properly. Once done with the first and most important one, migrating the others was simpler and quite fast.

The biggest challenge was to maintain the DNS entries while we were migrating and at the end we did it quite smartly.

We started with our inhouse cluster, which is meant for internal use. QA, Dev and Staging environments are all deployed in this cluster. Later on we did some stress tests and replicated the same cluster for production. It works very well for us so far.

Which kind of K8s setup was decided on in order to replace the current stack?

Regarding our cloud provider, we decided to stick with AWS simply because we were already using it without any particular problem.

We decided, though, to not adopt any automated K8s operations tools like kops, but went on with kubeadm with a single master and 2 worker nodes. We also made sure to maintain regular ETCD backups for master failures. One of the important reasons for choosing kubeadm was not to have vendor locking with AWS, and by doing so our K8s installation is indeed quite decoupled from AWS.

We decided to use Terraform as our Infrastructure-As-Code tool and Ansible as our configuration manager.

What are the benefits that came from adopting K8s?

We have no doubt K8s is currently the best container orchestrator available.

It brought us scalability and reliability in running our services, but most importantly it helped us to monitor our applications better and improve them.

We have more control over what and where we want to improve.

We were quite confident that monitoring our services running in our K8s cluster would help us to save money and scale more efficiently.

Presently we are running Prometheus and Alert Manager to monitor every aspect of the application, from our very specific business KPIs to the underlying infrastructure.

We run almost 7 stages of the entire application stack including QA, Staging and Dev environments, making sure all of them are identical to the Production environment and within a reasonable price range.

Which kind of problem did we face during or after migrating?

The biggest challenge was to migrate the application with the least possible downtime, and we had to deal with 2 important factors:

DNS propagation to new load balancers
Database migrations

After giving it a thought for quite some time we planned out to connect the old ECS and the new K8s clusters to the same database cluster. Then, after starting to run services on K8s, we migrated our DNS entries.

This trick helped us to make sure that we had very little, possibly zero downtime and it worked quite well. We were monitoring both the old and new clusters to make sure all the traffic was switched and only then tore down the old cluster.

What kind of skill set and team capacity is needed in order to migrate and manage a K8s cluster? Would you recommend it to companies of any kind/size?

Maintaining a self-managed K8s cluster requires a good amount of work and a well-thought-out plan. It’s more about getting the basics right and keeping room for improvement. There will always be an ongoing work in your cluster and so it is important to have proper resource allocation for that.

If it is your first time, it is always better to leverage a fully managed K8s cluster for Production environments and in parallel run a self-managed one for inhouse usage. Once the team has enough confidence, migrating from fully-managed to self-managed will not be a problem. It highly depends on the requirements and perception of the team whether to choose between fully-managed or self-managed.

We believe K8s should be the de facto choice for container orchestration and management, but the question about choosing between fully-managed and self-managed can be answered by looking at the experience of the team.

We think the team should practice DevOps. With a proper mindset and attitude it would not be a problem to manage a midsize cluster (10–15 nodes). The Kubernetes community is quite healthy and if the team really practices DevOps by core, running and maintaining a cluster will be fruitful and competitive related to other solutions.

Any conclusive words? Anything you would like to suggest to the reader or warn about?

There are a few important things to keep in mind when you run a K8s cluster.

Monitoring
Alerting
Disaster Recovery
Scaling

There should be a well-thought-out plan for all of the above right from the start.

Always read and learn about the best practices, but at the same time never take any suggestion or solution for granted.

Things that work for others might not work for you. So it is very important to know about your system and services.