Deploying to Ephemeral Environments with Helm Orchestration
At Upstart we have built a system for software engineers to easily spin up ephemeral environments on demand. This allows them to quickly test changes in a production-like system that is isolated from other environments.
I have now had the opportunity to work on creating ephemeral environments at two companies! See my earlier article on Using Kubernetes custom resources to manage our ephemeral environments. The system we’ve built at Upstart is similar, and that earlier article is still relevant. We use Kubernetes custom resources to model our environments, backed by custom Kubernetes operators built with the operator SDK to manage the lifecycle of our environments. Taking it a step further at Upstart, we have built operators for each third-party integration and dependency our services need. We use Helm to deploy our microservices and have built a Helm orchestration layer to simplify the complexity of ordered deployments and connecting all the services and dependencies together.
Build a platform
At Upstart, our software engineers are busy breaking up our monolith and new microservices are always being developed. A primary goal of our ephemeral environments architecture is to make it easy to deploy new microservices. We decided we needed a way to declaratively describe how to deploy new microservices and their dependencies to ephemeral environments. We built a platform for software squads to be able to self-service deploy their microservices, instead of the platform team building one-off solutions for each and every new microservice. Read on to learn more about our journey and the platform we built.
Solve smaller problems
A common approach to solving software problems is to break a large problem up into smaller problems and then solve those. An ephemeral environment requires many resources to be created, deployed and linked together (e.g., the services themselves, databases, AWS resources, and third-party integrations like Bugsnag and LaunchDarkly). Initially, we started with a monolith to deploy everything with one giant operator but quickly transitioned to multiple operators, striving for a single operator per resource. Separating operators makes it easier to work on each smaller problem independently. It allows for greater reuse and follows the Unix principle of “do one thing well”.
Our first attempt at making it easier for squads to manage their microservices in ephemeral environments was to create a way to deploy any Helm chart so that microservices can be packaged in a Helm chart and described declaratively. Asking a software squad to manage their own Helm chart is a much easier sell than asking them to write custom operator code in Go for their service or a custom CI pipeline. We built a Helm operator which can deploy and upgrade any Helm chart. Users can deploy a HelmRelease custom resource in Kubernetes specifying a Helm chart location, chart version, and a set of Helm values and the operator will deploy the Helm release and continually monitor it to make sure it stays in sync. This is similar to how ArgoCD is used to deploy a Helm release. We built this to deploy our microservices but it is now a central part of our solution to deploy everything.
Use composition
Once we broke the problem down and had a way to manage all of the individual resources of ephemeral environments, we needed a higher-level abstraction to group them. This abstraction makes it easier to examine and reason about what is deployed to an ephemeral environment.
We created another custom resource definition for an UpstartService
. An UpstartService
is composed of a list of HelmReleases
. Each microservice can be modeled using an UpstartService
and include whichever HelmReleases
it needs to create dependencies or third-party integrations. An UpstartEnvironment
follows the same pattern to describe the entire ephemeral environment. A user will compose a list of UpstartServices
to deploy into an UpstartEnvironment
resource.
When a user requests an ephemeral environment, the top-level UpstartEnvironment
resource is created, and from that a cascading collection of UpstartServices
and HelmReleases
. Finally, Kubernetes resources like Deployments
, ConfigMaps
and our own custom resources are created. Everything an ephemeral environment needs can be deployed using this concept.
Because everything a microservice requires to run can be modeled this way, it is something we feel comfortable asking software squads to own and self-serve.
In this diagram, every box is a Kubernetes custom resource. We have different operators to manage the different kinds of resources. All of the yellow boxes are HelmReleases
. Some Helm releases deploy traditional workloads like a microservice deployment or Postgres StatefulSet
and others deploy custom resources that are managed by operators. For example, DynamoDB tables are managed by the ACK project and Bugsnag projects and AWS secrets are managed by our own operators.
Orchestration via native Helm?
Helm is great for templating the creation of Kubernetes resources, and has some excellent features like chart hooks which are useful to run jobs pre-installation or post-installation. We use hooks for running database setup and migrations after our deployments. You can order your hooks if there are some operations that have to happen before others. Helm also has the concept of dependencies — you can reference other charts as “subcharts” of your Helm chart.
It is therefore possible to deploy a microservice and all its dependencies with just a single Helm chart, hooks, and subcharts. We followed this approach in some of our earlier solutions, but ultimately found some challenges with it. The rest of this section discusses these challenges and why we moved away from doing orchestration with the tools that native Helm provides. You might find that some of these other solutions are all you need if you are building a similar system at your company.
Challenge #1: Subcharts are deployed at the same time
Resources in subchart dependencies are deployed by Helm at the same time as the main chart. This presents a challenge when the order of deployments matters. For example, many of our services require a database. We deploy a containerized database in ephemeral environments for each service that needs one. Services require the database to be ready before attempting to create tables and seed or mock data. An earlier solution we used was to add a subchart to deploy the database, and then use an init container in the job that prepares the database to wait for the database to be up.
Eventual consistency with Kubernetes?
Some would argue that you should just deploy everything at once and let Kubernetes handle reconciling the system into a working state. Eventual consistency and continuous retries are key features of Kubernetes. In the database setup example, you could create a job to prepare the database tables and seeds and deploy it at the same time as everything else. The job will fail until the database is ready but continually retry until successful. That can work, but in our opinion it becomes very hard to debug and understand, especially when you have hundreds of these dependencies across an ephemeral environment.
Use Helm hooks to order deployments?
We were tempted to build a series of Helm hooks that deploy in order as a way to orchestrate dependencies. Helm hooks are not cleaned up by Helm by default on destruction, which creates the new responsibility of resource cleanup. Typically when running a job as a Helm hook, we found we also needed to create a number of other resources via hooks (like ServiceAccounts
, ConfigMaps
, Secrets
) so that these resources are available to the job.
Using hooks for ordering on smaller projects makes sense but it is a weak substitute for dependency management that doesn’t scale up well in larger systems. The only option for ordering is setting a single number on the hook. There are no warnings or constraints when multiple hooks use the same number, so you have to coordinate the ordering of hooks across charts and subcharts so that they do not conflict. The only way to find out the deployment ordering of the entire system is to analyze all Helm templates. If you ever want to add additional hooks later you may have to re-order everything.
Challenge #2: How to connect all the resources?
Ephemeral environments create some unique challenges where we must create infrastructure and third-party integrations on the fly. This removes the burden of software engineers having to worry about creating resources beforehand and allows the environments to spin up quickly.
Another challenge we faced with using native Helm for orchestration was connecting all these resources together. Since all dependencies and services can be deployed independently, how do we configure each resource to know about the other resources so they can work together? When we had a monolith operator we could make all these connections in code, but code-based configuration is not an option in our decentralized system.
Service discovery?
Service discovery is a partial solution to this problem so that microservices know how to connect to each other when they need to make network requests. Our service discovery would be a great topic for another article — in ephemeral environments it provides a very powerful way to hot swap different “implementations” of a service (e.g., real or mock, or proxy to another shared environment). But it isn’t a full solution for our needs.
Convention over configuration?
I’m a big advocate for convention over configuration and standardization. If we could convince all squads to build their microservices using the same patterns for configuration (e.g., all services could expect database configuration to be passed to them using an environment variable named DATABASE_URL
) then our ephemeral environment system could pass along the configuration the same way for all services. Unfortunately, right now our existing microservices are each configured quite differently, so we needed a more flexible solution.
Even if we had service discovery and all services used convention over configuration, we’d still have some challenges. For example, we create a Bugsnag project dynamically and need to retrieve the API key from Bugsnag to pass it to our microservices as an environment variable. In this example we must wait for something to be created asynchronously and then proceed with dependent deployments. We could use hooks and init containers for this, but we’ve already discussed issues with that solution.
A simpler helm orchestration system
We still really love Helm, but we wanted a simpler solution to orchestrate the deployment of resources. We also wanted a solution that was easier to visualize and reason about. We didn’t want software engineers using our system to have to learn all the intricacies of Helm hooks, init containers and subcharts or write custom Go operator code to deploy their microservices. We looked briefly at Orkestra which appears to solve the orchestration and dependency problems of Helm releases quite well.
We also wanted an easy way to gather “outputs” from dependency Helm releases and use them as inputs to other Helm releases to solve the problem of connecting resources together. I have done a lot of work with Terraform and Terragrunt which use the same pattern of linking outputs from one module to inputs of another module. We also drew inspiration from Crossplane.io which allows external resources to be created via Kubernetes resources and has a way to link outputs from one resource to inputs of other resources.
With those requirements in mind, we built additional features into our UpstartService
abstraction described earlier. The Helm releases defined inside an UpstartService
can now depend on other Helm releases or other UpstartServices
. A HelmRelease
will only be created once all of its dependencies are ready and all the outputs from dependencies are created. If a HelmRelease
has dependencies, then it can use the outputs of those dependencies to dynamically fill in variables in the Helm values at deployment time.
Here is an example of an UpstartService
that has several dependencies.
apiVersion: upstartoperators.upstart.com/v1alpha1
kind: UpstartService
spec:
helmReleases:
bugsnag:
chart: upstart-charts/bugsnag-project
version: 0.1.0
values:
type: flask
outputsFrom:
- secretRef:
name: bugsnag-bugsnag
postgres:
chart: upstart-charts/postgres
version: 0.1.3
values:
dbName: my_service
image:
registry: upstart-charts/docker-io-proxy/library
repository: postgres
tag: "14.2"
outputsFrom:
- secretRef:
name: postgres
redis:
chart: upstart-charts/redis
version: 0.1.1
values:
image:
tag: 6.2.6-alpine
outputsFrom:
- configMapRef:
name: redis-outputs
exampleService:
chart: upstart-charts/example-service
dependencies:
- name: bugsnag
- name: postgres
- name: redis
- name: my-environment-name-launch-darkly
- name: another-upstart-service
outputsFrom:
- configMapRef:
name: outputs
scope: external
values:
...
global:
env:
BUGSNAG_API_KEY: << bugsnag.outputs.BUGSNAG_API_KEY >>
REDIS_BACKEND_URI: 'redis://<< redis.outputs.SERVICE_FQDN >>:6379/1'
LAUNCHDARKLY_SDK_KEY: << my-environment-name-launch-darkly.outputs.LD_SDK_KEY_PRICING >>
DATABASE_URL: "postgres://<< postgres.outputs.POSTGRES_USER >>:<< postgres.outputs.POSTGRES_PASSWORD >>@<< postgres.outputs.PG_HOST >>:5432/<< postgres.outputs.POSTGRES_DB >>"
ANOTHER_UPSTART_SERVICE_URL: "<< another-upstart-service.outputs.url >>"
First, there are Helm releases defined for a Bugsnag project, a Postgres database, and a Redis database that the service requires. Then a Helm release for exampleService
which depends on those earlier Helm releases and two other UpstartServices
. This Helm release then references the outputs of those dependencies via the syntax << dependencyName.outputs.outputName >>
. A Helm release can generate and store outputs in any Secret
or ConfigMap
and choose what it chooses to share with other Helm releases.
Even without much documentation we feel that this method of describing dependencies and referencing outputs is intuitive for software engineers and much simpler than working with the earlier solutions we described.
Conclusion
At Upstart, we found that we needed a solution for software squads to own the management of their microservices, from local development all the way to production. Our small squad inside the platform organization can’t scale to build everything for every squad for deploying to ephemeral environments. That need led to the decentralization of our own solution and the creation of a simple declarative system to describe how to deploy microservices in an ephemeral environment.
There are many challenges to building an ephemeral environment system and we feel confident we have solved the orchestration and configuration problems for deploying microservices. We are excited to have built a platform that enables our software squads to build microservices that work in ephemeral environments.