Using kubernetes custom resources to manage our ephemeral environments

Published in

Beam Benefits

8 min readMar 26, 2021

At beam we built a system to easily spin up ephemeral testing environments using kubernetes. We use kubernetes custom resources to model the state of each ephemeral environment. The result is a powerful way to manage a complex system in a simple declarative manner.

old ephemeral containers just disappear into the kubernetes fog, never to be seen or heard from again

The old staging1/staging2 days

Years ago when beam had a much smaller number of engineers, we maintained a couple of staging environments where we could test out larger changes in a production-like environment. As our teams grew, getting time on a staging environment became harder than finding 2-ply in a pandemic! We resorted to various methods to try and coordinate access but it was obvious to all that we needed a better way, and creating a staging3 environment wasn’t going to cut it.

We wanted to empower all engineers to be able to spin up their own “staging” environments on demand. The need for these environments is normally short-lived, so we built in a TTL so that they are removed automatically by the system. Hence the name “ephemeral environments”.

We’ve gone through at least 3 major revisions to the ephemeral architecture over the last couple years. Currently we are running everything in a kubernetes cluster that is very similar to our production kubernetes cluster. There are many different pieces to the architecture that we may explore in future blog posts. In this post I want to talk about how we model and manage these ephemeral environments using the same declarative approach that kubernetes uses for its core objects (deployments, jobs, configmaps, etc.).

Declarative approach

The declarative approach has been documented and discussed already, here are some good articles on it. The quick explanation is you describe what you want in a set of configuration files. For example you could say you need an nginx server listening on port 8080, and accepting traffic at https://nginx.example.com/demo. You write some beautiful yaml and apply it to kubernetes and it figures out how to spin up a container for you and keep it running. If for some reason nginx can’t be immediately started (e.g. maybe you’re experiencing network issues and can’t pull the nginx image, or have run out of resources on your nodes and need a new one added to the cluster), then kubernetes will keep trying and should eventually succeed once the network issues clear up or a new node joins the cluster.

One of the additional benefits of the declarative style is when you want to make changes. If for example you want to upgrade to a newer version of nginx, all you need to do is edit your configuration and apply it again. Kubernetes will figure out how to take down your old containers and spin up new ones with the new versions, you don’t have to worry about any of that.

We saw the power of the declarative approach and decided to model our ephemeral environments as custom resources. At beam we’ve started to break our monolithic application into smaller services. So deploying an ephemeral environment means we need to deploy many of the core services for our engineers to be able to use the environment. A typical use-case is for an engineer to spin up an ephemeral running a feature branch feature/xyz of our beam-api service and develop for the rest of the services. Or sometimes engineers need a few services to be running different branches for a more complicated feature. Our ephemeral custom resource captures that information and describes what the ephemeral should look like. Here is an excerpt of an ephemeral custom resource.

spec:
  build_id: zrjj
  databases:
    beam-api-elasticsearch:
      availabilityzone: us-east-1a
      dbtype: elasticsearch
      size: 2Gi
      volumesource: volume
    ...
  services:
    authentication-server:
      branch: develop
      canary: false
      chart_version: 0.1.33
      git_sha: d0b80904922327537392ec1999fcea38c640470c
      tracking: false
    beam-api:
      branch: feature/xyz
      canary: false
      chart_version: 0.1.33
      git_sha: 6948643c70c5f9665891c0e1ad97d2eee7328f01
      tracking: true
    dentist-search-map:
      branch: develop
      canary: false
      chart_version: 0.1.33
      git_sha: 4c22040b1c30ef9108f824bf010ff1ae6638b16a
      tracking: true
    ...
  subdomain_name: my-new-feature
  version: ca8c60c4071c4ce5a4b38cf56ab577bb

As you can see, we can succinctly describe what versions of various services we want to run, what subdomain we want to use, databases, etc.

The ephemeral operator

Great you’re thinking, I can describe what I want but how do we create a new ephemeral environment from that information? That is where our ephemeral-operator comes in. Kubernetes has controllers (also called operators) to manage most of the native kubernetes resources. For example there is a deployment controller and a jobs controller to manage deployments and jobs. We built our own operator to manage our ephemeral custom resources. We used kubebuilder to make this easier. The way it operates (see what I did there) is fairly simple. It watches for changes to the ephemeral custom resource objects. When it sees either a new ephemeral or modified ephemeral the operator knows it needs to either create a new environment or update an existing one.

We could have built in features in the ephemeral-operator to know how to deploy new environments or update existing ones directly, but it turns out these operations are essentially what our deployment toolchain already handles for production deployments. So we decided to keep the operator fairly dumb and have it call out to our deployment toolchain when it decides work needs to be done.

We do that by running kubernetes jobs, either up jobs or refresh jobs. By using kubernetes jobs we gain the benefit of retries, so if the environment can’t be created the first time we’ll give it another few chances. The ephemeral-operator watches these jobs as well and once they complete it will update the status of the ephemeral custom resources. By comparing the spec and the status of the ephemeral custom resource we can tell if it is in the desired state or not. If it’s not in the desired state the ephemeral-operator can decide to keep trying additional jobs or give up and report an error for engineers to see what the issue might be.

Deploy jobs

The main purpose for creating another tool instead of just using our existing deployment tools is to handle some of the intricacies unique to the ephemeral cluster. deploykit is only responsible for deploying to ephemeral environments. It is a very thin wrapper around our other deployment tools.

Having this code outside of the operator has several benefits. Most importantly, it allows each part of the system to be developed independently. The operator is fairly stable but deploykit is still in active development. In fact the original version of deploykit was written in a different language and deployed each ephemeral instance to its own ec2 server instead of kubernetes. We were able to rewrite it to deploy to kubernetes without having to change the ephemeral-operator at all!

Another benefit of reusing much of our deployment tooling is we have an environment where we can test out changes to the deployment tooling itself! I mentioned our ephemeral environments are very similar to production, we run the same service mesh, canary service, ingress controller and other kubernetes infrastructure. It quickly becomes complex and it’s essential to have a place for our SRE team to develop and test changes to our architecture and deployment tooling.

A final benefit is that because deploykit is now primarily written in python, it opens up development to engineers not comfortable with go or kubernetes operators by providing a simple job construct to work with.

Launching an ephemeral environment

To create a new ephemeral environment all we need to do is create an ephemeral custom resource object in our ephemeral kubernetes cluster. Because this object can be described using simple yaml, it is easy to create and modify. We have a number of different ways to create ephemerals.

beam-cli eph up
This is the most widely used way to create a new ephemeral, using the custom cli we developed. Our engineers use the cli for all sorts of tasks and managing ephemerals was one of the first use cases for the cli. Using simple command line flags you can specify which branches to use for each services or leave them off to deploy the latest from develop.
/ephemeral
An even quicker way is to simply comment /ephemeral in a github pull request! We run a github action to detect that comment and create an ephemeral using the branch of the PR.

eph-api
We built a simple REST api to manage ephemeral objects as well. Users could interact directly with the kubernetes api instead of course, but
eph-api providers a simpler interface. Using a REST api makes it easier to complete more programatic tasks, such as creating an ephemeral to run a set of end-to-end tests nightly.
kubectl
We don’t use this much for creating new ephemerals but I often use kubectl to edit existing ones as it is a quick and easy way to make changes.

The core pieces to our ephemeral architecture

Tracking branches

Another powerful feature of our ephemeral architecture, tracking branches, is made simple by leveraging our declarative ephemeral custom resources. We wanted to make it easy for our engineers to push code changes to their ephemerals. A typical workflow is to create a feature branch, spin up an ephemeral, then iterate on making changes and requesting feedback from other engineers or business stakeholders. With tracking branches, engineers don’t need to worry about manually updating their ephemeral environment every time they push a commit, we handle that for them!

The way we implemented this is not too complicated. Remember, all we have to do is update our ephemeral custom resource object with the git_sha we want to run and the ephemeral-operator and deploykit will handle getting it running. We just need a way to track code changes. We do that by tracking changes in our AWS ECR docker repositories. Our CICD system is setup to build docker images and push them to AWS ECR on every git push. We modified that process slightly to also tag each docker image in ECR with what branch it was built from. Then we setup an AWS cloudwatch event rule to send a notification to eph-api. eph-api figures out which ephemerals are interested in this new ECR image and updates the ephemeral custom resources accordingly.

We’ve found that for most use cases having tracking turned on for an ephemeral is what engineers expect. However there are times when it isn’t what we want. For example if we’re doing a live demo we probably don’t want the system to be upgraded midway through your presentation! In scenarios like this engineers can turn tracking off with the tracking flag in the ephemeral custom resource.

Conclusion

I hope you can see the power of using kubernetes custom resources to extend the functionality kubernetes provides into your own domains, and the many benefits of empowering your engineers to standup self-service infrastructure to complete their tasks.

Finally I’d like to give a special thanks to Silas Baronda who came up with most of the ideas described here way back before we even ran anything on kubernetes!