Abstracting Kubernetes Complexity: Simplifying Our Platform

Published in

CODE + CONTOUR by IPSY

12 min readMar 14, 2024

In the dynamic, fast-paced world of technology, a classic infrastructure stack can be challenging to maintain since developers need to deploy new versions of their applications as fast as they are released while also scaling at the speed of light.

On the infra-platform engineering team at IPSY, we took this into account, along with cost-effectiveness and platform resiliency, when considering a change to our platform. From there, we decided to evolve our applications from the classic Amazon Web Services (AWS) Elastic Compute Cloud (EC2) to the state-of-the-art in the industry, Kubernetes. This is what that journey looked like.

Where We Started

Our infra stack came from a classic EC2 infrastructure working with a configuration manager and an AWS codedeploy configuration.

Terraform: We used this for managing the infrastructure needed for every application to run.
Puppet: For configuring our instances, we used a basic CentOS7 AMI, so Puppet could take several minutes to finish (and since running an instance takes more time, you don’t just lose time, but also money).
Jenkins: To build and deploy our applications, we used Jenkins for building our artifacts based on the runtime environment. Of course, we can debate about the pros and cons of Jenkins, but we did manage to create a shared pipeline library for everyone to use that helped streamline work. While I don’t think this is a pain point for the end user, it could be if you were writing your own code.
Codedeploy: We used this for deploying the artifacts to our EC2 infrastructure. We deployed our applications using a combination of Jenkins pipeline and the codedeploy-agent that runs on our Linux EC2 infrastructure.

Drawbacks of This Setup

Our original setup was outdated and had some drawbacks:

Slow scaling time: Auto Scaling groups, with custom AMI and configuration manager, could take forever.
Terraform code management: Managing a lot of modules can be tedious for developers.
Jenkins code: Similar to Terraform, it can be a hassle if you need to code your own functions for every case.
Costs: Consider that on EC2, we ran just one application per instance.

Why Kubernetes?

We considered these killer features of Kubernetes:

Automatic scaling: You can scale your application, but also your cluster automatically.
Deployment speed: If you have an app with low start-up time, you can easily and quickly scale it to hundreds of pods.
No vendor lock-in: When running on containers, you can run your application on every Kubernetes engine (e.g., EKS, AKS, GKE, on-premise cluster).
Cost efficiency: As an example, let’s compare running a simple Java application on EC2 instances versus the same application on a Kubernetes cluster. At IPSY, we will might see this:

EC2:

Amount of instances needed to support the workload (ASG): 36
AWS Instance type: t3.small

Kubernetes:

Amount of pods needed for support the workload (HPA): 10
AWS Instance type for Kubernetes nodes: c5.2xlarge
Percentage used from the cluster: approximately 1.26%

Costs can change depending on your Kubernetes setup, your application runtime, and of course the size of it. According to our estimates, running IPSY’s applications in Kubernetes is 70% cheaper than running them on a classic EC2 infrastructure.

The Evolution

We planned to deprecate some pieces of the stack mentioned above, but we expected a few challenges along the way. First, not all teams are able to migrate when you put new infra components in place. Also, the Kubernetes infrastructure complexity can grow, which requires teams to have sufficient knowledge about containers. Containerized apps also require fine-grain tuning to be efficient when running in Kubernetes. But above all, when adding a new technology to your stack, the migration process can be extremely complex due to the number of changes and new concepts involved.

Our Approach as Infrastructure Engineers

We understood that, from our developers’ perspective, they might just want their applications to be magically deployed and running as smoothly as possible. So as infrastructure engineers, we wanted to help make our developers’ lives easier when moving to Kubernetes, so we created our own solution that is a combination of technologies — all of them managed and abstracted by a Python script we called Jarvis.

What Jarvis Did for Our Developers …

Jarvis is in charge of creating templates for every aspect of the components required for a user to bootstrap an application inside our clusters.

Terraform: Terraform configurations are stored in the very same repository where EC2 apps are coded, but in our case, we created all the files automatically based on user-provided details (like their desired deployment environment).
Advantage: Developers don’t need to change the repo or start from scratch. Jarvis will take care of everything.

Suggestions : Modularize all functions. It is always better to have your own modules with your standard instead of calling resources directly. This will, of course, speed up the usage of your code, since calling a defined module is always easier for an end user than coding what they need. You also can force your own standard.

Always pin the provider version you need (versions.tf)

A better way to force your tf version

terraform {
required_version = "~> 0.12.0"
}

The less desirable way to force your tf version (with this config you will end up going to the latest Terraform version, and in most cases you don’t want that.)

terraform {
required_version = ">= 0.12"
}

Helm: Since we needed templates for our Kubernetes resources, we decided to use the best templating tool on the market right now.
Advantages: Flexible and easy to code and understand with plenty of examples due to huge community adoption.
Disadvantages: None.

Suggestions: Try to have every resource on a different YAML file instead of having one big one so the configuration is easier to manage.

Amazon Elastic Kubernetes Service (EKS): We have 99% of our infrastructure on AWS, so we decided to run our workloads on top of EKS with Amazon Linux 2.
Advantages: AWS manages the cluster, so you don’t have to worry as much about maintenance.
Disadvantages: Managing the EKS lifecycle and keeping clusters up to date can be challenging since you can’t guarantee zero downtime for your applications. For that reason we are using a blue/green deployment strategy with applications on different clusters, migrating from one EKS version to another when there are major or breaking changes.

Suggestions: If possible, use Karpenter to optimize your cluster scaling process and, in non-prod environments, use a high percentage of Spot instances to save on costs.

Docker files: We prepared a set of images that supports different runtimes used in the company (i.e., Java, Node.js, Python, etc.). At build time, each app chooses a runtime, compiles code in it, and creates an executable image within our standards. All this happens automatically when a developer merges its code. This way we guarantee that the image will be as small as possible, making it easier to manage (from a cluster point of view) when pulling from the registry.
Advantages: With this approach, you have control of your standard processes from end to end, including the versions of dependencies for your app. This involves creating a small initial Docker image where all dependencies are installed to build the artifact. Then, we move the artifact from the initial image to the final one, deleting the original.
Disadvantages: One size fits all is not always the case. You may need to have some custom Docker files for specific applications.

Suggestions: Pin all the dependencies that your apps need to run, and make your Docker file as small as possible.

Kapp: How will you apply and manage your Helm templates? With Kapp. At this point you might be wondering, why Kapp? Why not just Helm ? Here is why…
Advantages: Kapp works like Terraform in that it will calculate differences between the cluster and your configuration files while also ensuring that resources are ready before proceeding. So combining this with a powerful tool like Helm can be a good way to go.
Disadvantages: This isn’t a true disadvantage, but Kapp doesn’t manage templates or packages, it just manages your deployment workflow.

Suggestions: Not many. If your Helm templates are well-defined, then Kapp will process them.

Trivy: This adds a security layer to your pipeline. Since it’s common practice to use public Docker images, it’s important to think about adding something that can warn you “This code is full of vulnerabilities, please don’t deploy this.” You can include a step for scanning your images inside your pipeline as we did with Trivy running on our Jenkins pipelines.
Advantages: Easy to use. You can even scan your whole cluster; for example, if you wanted to scan an image, you can do it like so.

trivy image python:3.4-alpine

This will give you a full report of the image vulnerability status.

Disadvantages: One small drawback is that if you use a public database for scanning and it isn’t online, your pipeline will fail — so it’s a good idea to have the latest -1 version of the database locally stored for use as a fallback.

Suggestions: Include Trivy as a key piece in your pipeline.

Jenkins: But of course! Even though we were able to adapt what we had to make it work for EC2 and Kubernetes, Jenkins is still a key piece for us at IPSY.

Suggestions: Use a shared library to make pipelines smaller and easier to read. Granting access to your code selectively ensures that only authorized individuals can contribute, reducing the likelihood of introducing buggy code.

The IPSY Containers Lifecycle:

Highlights of Our Journey to Kubernetes

If you’re thinking, “OK, I’m familiar with this technology stack you’re mentioning here,” or “We had the same issue and we solved it with a simple Helm chart,” that may all be true. But now that you’ve had a proper introduction to what we had at IPSY and where we want to be, I’m going to tell you about things that aren’t technical but take a long time to design, test, and put in production with zero downtime.

Standardize Everything

Here at IPSY we strongly believe in standardizing everything. We have well-defined standards for infrastructure resource creation and application deployment — standards that have evolved and are being constantly improved. Of course, there is a trade-off with this: What happens when something doesn’t meet the standard but still needs to be deployed? We’ll get into that later.

Well-Defined Standards and Jarvis Bootstrap

As I mentioned earlier, we created an in-house bootstrapping tool called Jarvis to automate the deployment process. With EC2 we had to code our infrastructure and pipeline, but with Jarvis we were able to do everything automatically with the technologies I mentioned before (awesome, right?). Why did we do this? Because we needed a powerful tool to speed up our migration process and we couldn’t find a good fit. So as an engineering team, we created our own tool (yes, we rock). Here’s how we made this work in a big and dynamic environment like ours.

Always define and follow the standard: Think more, code less! This will grant engineers the ability to make things in the best possible way using the tools provided, and not just that — if good standards are followed, errors are easier to solve and processes easier to manage.

Always work hand in hand with dev teams: Collaboration is key at IPSY. You always have the opportunity to discuss technical topics with other teams and develop solutions that meet the needs of a majority of applications. We propose, discuss, act, and evolve our thinking on what the best approach is for everyone, which also means listening to other teams to try to give them what they need to be comfortable using a new solution. It may take a lot of time and work, but the results are awesome when everyone can add not only knowledge but code — so that on this journey we are not just coding but learning together.

Document as much as you can: As you know, technology people spend most of their time reading, so while we were on this journey, we created a ton of internal documentation to be shared across the organization. We added as much graphical information as possible so that engineers could have an understanding of what they were doing and speed up the process overall.

Support your teams: No matter how good your pipeline is, no matter how good your tool is, teams will always need support from your side. But if you invest time in designing a good tool with good documentation, the need for support will be much less.

What We Learned

Like every big project, there were some trade-offs that we took as challenges at the end of the day and we solved them. I’ll mention some of our biggest learnings.

Strong standards should also be flexible: What happens when something doesn’t meet the standard you’ve established? Instead of thinking “Oh no, my design isn’t bulletproof,” try thinking, “Let’s evolve to add more flexibility.” Thanks to the people using the tools we designed, we learned that we needed to evolve and change our standards. One of the most important things we had to add was the ability for the users to add custom settings to their deployments. How did we do this? We added the ability to our Jenkins pipeline via the Advanced Settings feature. It enabled users to add custom settings to the final Helm template used by Kapp at the end of the deployment. Can you see how the dots are connected?

Simplify interactions with infrastructure: We added the ability to update the image of any deployment while keeping your parameters (like CpuLimits, Hpa), saving you the time and effort of copying and pasting parameters. You would have the ability to choose your own adventure, whether you wanted to execute a pipeline that would ask you for clicks and configurations, do it yourself in just a few steps, or automate the process entirely.

You don’t need to migrate everything: In the process of exploring new technologies, we thought at some point, “Yes, let’s move everything to Kubernetes.” But something you should know if you are walking this path is that not all workloads are a good fit and can be moved to Kubernetes. If something doesn’t fit, don’t just try to migrate it — instead, try to redesign that app in a way that it can fit in a container.

How to canary test from EC2 to Kubernetes:

This was one of the most challenging tasks for us until we found a proper solution. What happens when you have your app working on EC2 and running in Kubernetes, but you are not confident about moving traffic to it? Well, we found a solution that suited our way of working and might also suit yours. Since we work on AWS, we have Application Load Balancers (ALB) for all deployments, so with a combination of those and the TargetGroupBinding feature of AWS Load Balancer Controller, we managed to start sending a portion of the traffic from EC2 to Kubernetes with zero downtime. This granted us the ability to test if the cluster supported the workload, if the HPA and cluster autoscaler worked correctly, and most importantly, our fast rollback method if something were to go wrong. We used this in combination with what we called “Advanced Settings” to send the ARN of the TargetGroupBinding to the deployment config and direct traffic.

What to do when the cluster autoscaler isn’t fast enough: We worked around this limitation using the “Overprovision’’ method. This is what we did:
First, we created a LowPriorityPolicy (-10) for dummy pods.
Then we deployed some dummy pods with that policy attached. This reserved “buffer” nodes, which made room for pods with higher priority. If there was no room for those pods (default priority is 0), these low-priority pods would have a “pending” state, prompting Kubernetes to create a new, almost empty node for apps to scale. You can also use this approach with Spot and OnDemand instance types.

Conclusion

We are still working to move all of our workloads to Kubernetes. Although the journey can be hard since Kubernetes is a complex technology, it is worth it to learn and implement if you want to have a resilient and highly scalable infrastructure with one of the most modern and industry-standard technologies on the market. Here are some final thoughts to keep in mind…

You’ll need to define clear limits for what every piece of technology will do. For example, what to do with Terraform versus what to do with Helm. At one point we had to recalculate and start managing our load balancers with Terraform, but maybe for you it will be better to manage them with Helm (AWS Load Balancer Controller). This will depend not only on your team but on your environment.
There aren’t golden rules on this journey. As you start on this path you might find limitations (like the technologies, the time, the design) and drawbacks. The idea of this article is to provide you with some insights and hints if you are ready to start this adventure.