One of the key epiphanies I had when learning Kubernetes was that it is, at its core, a declarative system. You provide Kubernetes with the resource manifests that represent the overall workload you’d like to have in your cluster and Kubernetes works like mad to make that the current state of affairs. More importantly, it also works to keep it that way: working to maintain this state through all sorts of failures like that of a pod or node.
The implication of this is that there is a set of textual resource manifests that completely specify the current state of what should be running in our Kubernetes cluster — and this observation makes makes this set of resource manifests a natural fit for source control systems.
Using source control like this as a central system of record, known as GitOps, is increasingly popular with folks running large scale Kubernetes deployments in production. It enables easy recovery from disaster, a simple and well understood operational model, and better security and audibility. The rest of this blog post will talk about the tooling and processes we have used to establish a GitOps workflow in our own deployments.
Simple Cluster Management
In a GitOps based deployment, a pod running in the cluster watches a specific git repo that contains the set of resource manifests that should be running in the cluster. One implementation of this, and the one we have utilized in our own deployments extensively, is Weaveworks’ Flux.
The configuration of Flux is fairly simple: you provide a git endpoint that acts as the resource manifest repo of record — and the secrets it needs to access this git repo — and Flux will then check this repo periodically for new commits and reconcile these with the resource manifests running in the cluster.
Familiar Git Workflow
GitOps provides a familiar workflow for developers. The same workflow that they use for making code changes can be used with the changes that they make for operations, including pull requests for reviews of operational changes, branches for testing commits in different cluster environments as part of the pull request process, and an audit trail of commits that describe how the system has evolved and who made those changes.
GitOps implemented in this manner is also more secure. As described in Figure 1, Flux makes an outbound connection to this repo, verifying that it is talking to the right git repo cryptographically through TLS.
This last point is important, because other approaches, popularized by Helm’s Tiller, rely on opening an external endpoint, backed by a pod running with RBAC roles that enable creating and deleting workloads. For many enterprise deployments, having an external endpoint with such wide ranging cross cluster privileges is a non starter, and the next version of Helm will remove Tiller in favor of an approach like this. For this reason, most GitOps implementations, continue to use Helm’s excellent template capabilities but not its Tiller deployment mechanism.
Simple Disaster Recovery
A key observation is that our resource manifest repo is a store of record on exactly what should be deployed in our cluster. This means when we have a Kubernetes cluster failure we can simply spin up a new one, point it to our repo of record, and all of the manifests will be reapplied to it. We don’t need to manually reinstall components with a shell script or manually poke a CI/CD system to restart deployments for each of our microservices. This is an operationally crucial point, since it dramatically lowers the recovery complexity for someone responding to a page at 3AM in the morning.
High Level Deployment Descriptions
As we’ve mentioned, the git repo of record that Flux watches contains low level Kubernetes resource manifests. These manifests, typically in YAML, are compared against the current state of the Kubernetes system to determine which actions it needs to take to bring the cluster into the same state. Each of these manifests are typically complex in real world situations. For example, all of the resource manifests needed to deploy ElasticSearch can run to more than 500 lines and the complete deployment of an Elasticsearch / Fluentd / Kibana (EFK) logging stack can be more than 1200 lines. This YAML is typically also very dense, context free, and very indentation sensitive — making it a dangerous surface to directly edit without introducing a high potential for operational disaster.
This has traditionally been solved in the Kubernetes ecosystem with higher level templating. Tools like Helm provide templating around the boilerplate inherent in these resource definitions and provide a reasonable set of defaults for configuration values. We believe that Helm continues to be the best way to generate these resource manifests, and we use Helm as a templating engine in our GitOps CI/CD process, checking the generated resource manifests into the resource manifest git repo.
That said, typical production Kubernetes systems tend to compose many Helm charts to arrive at the final set of resource manifests that are applied to the system. For example, to deploy the EFK logging stack above, you might want to generate resource manifests using four charts from helm/charts: stable/elasticsearch, stable/elasticsearch-curator, stable/fluentd-elasticsearch, and stable/kibana. While you could utilize shell scripts to do this, this is brittle and not easy to share between deployments, something that is essential in large company contexts where they may have hundreds of clusters running and where reuse, leverage, and central maintenance is crucial.
Instead, we’ve been utilizing the concept of a “stack”, which collects one or more subcomponents such that can be referenced in a higher level deployment definition. Such a stack for the above EFK logging stack looks like:
name: "elasticsearch-fluentd-kibana"generator: "static"path: "./manifests"subcomponents:- name: "elasticsearch" generator: "helm" source: "https://github.com/helm/charts" method: "git" path: "stable/elasticsearch"- name: "elasticsearch-curator" generator: "helm" source: "https://github.com/helm/charts" method: "git" path: "stable/elasticsearch-curator"- name: "fluentd-elasticsearch" generator: "helm" source: "https://github.com/helm/charts" method: "git" path: "stable/fluentd-elasticsearch"- name: "kibana" generator: "helm" source: "https://github.com/helm/charts" method: "git" path: "stable/kibana"
Figure 3: High Level Stack Deployment Definition for an EFK Logging Stack
This high level definition is rendered to Kubernetes resource manifests as part of a CI/CD pipeline utilizing a tool we built (and open sourced) called Fabrikate. This enables the components of a deployment to be written at a higher — and less error prone — level and to enable the reuse of these components between deployments, giving us much higher leverage and enabling us to focus on the actual microservices we are trying to launch with each deployment versus the common set of infrastructure that surrounds them.
As most people that have used Helm in the real world can attest, it’s usually not possible to directly use community built Helm charts like this off the shelf. In almost all real world scenarios, changes to the default configuration values provided with the chart are required — and complicating this is that this config could differ between the different clusters that employ this chart (eg. a ‘prod-east’ cluster running in East US and a ‘prod-west’ cluster running in West US might need different configuration).
Fabrikate solves this with composable configuration files. These configuration files are loaded and applied at generation time to build the final set of configuration values that are used during templating with helm. Using our EFK stack example from above, and since we know the different subcomponents that make up this stack, we can preconfigure the connections between these different subcomponents with config values with a configuration file that looks like this:
config:subcomponents: elasticsearch: config: namespace: elasticsearch client: resources: limits: memory: "2048Mi" fluentd-elasticsearch: config: namespace: fluentd elasticsearch: host: "elasticsearch-client.elasticsearch.svc.cluster.local" kibana: config: namespace: kibana files: kibana.yml: elasticsearch.url: "http://elasticsearch-client.elasticsearch.svc.cluster.local:9200"
Figure 5: Configuration file for EFK Logging Stack
These values will be applied directly to Helm during the templating step we do for these components at resource manifest generation time.
Reuse across environments and deployments
Fabrikate can also generates the final resource manifests for a particular deployment by combining multiple of these configuration files. This enables reuse of the same high level definition across cluster environments (eg. qa, staging, and prod).
For example, if we had a set of geo redundant production clusters in our East and West US regions, we might have three configuration files: ‘common’ which covers configuration that is common across qa, staging, and production environments, ‘prod’ configuration that is specific to production deployments but common across them, and then ‘east’ configuration for configuration that is specific to clusters running in the East region. These configuration files are then combined in priority order by Fabrikate to produce the final set of configuration that is then applied to the cluster such that we can factor config out into appropriate files such that we don’t need to repeat ourselves.
Our EFK stack above can be itself checked into a git repo and referenced from another deployment definition file. For example, if we wanted to define a “cloud native” stack with observability, service mesh, and management components included, we could express this with a deployment config that looks like:
name: "cloud-native"subcomponents:- name: "data" source: "https://github.com/timfpark/fabrikate-elasticsearch-fluentd-kibana" method: "git"- name: "prometheus-grafana" source: "https://github.com/timfpark/fabrikate-prometheus-grafana" method: "git"- name: "istio" source: "https://github.com/evanlouie/fabrikate-istio" method: "git"- name: "kured" source: "https://github.com/timfpark/fabrikate-kured" method: "git"- name: "jaeger" source: "https://github.com/bnookala/fabrikate-jaeger" method: "git"
Figure 4: Cloud native stack that builds on reusable lower level components
Such a hierarchical approach to specifying deployments allows for the reuse of lower level stacks (like the EFK example above) and for updates to these dependent stacks to be applied centrally at the source for the stack — as opposed to having to make N downstream commits in each deployment repo.
Another crucial observation is that in higher level definitions like this we also have more global context. For example, in the cloud native stack above we know that the deployment will have both Prometheus and Kured. We can exploit this context by providing a common configuration file that preconfigures Kured at Prometheus since we know it will exist in the cluster:
config:subcomponents: kured: config: extraArgs: prometheus-url: http://prometheus-server.prometheus.svc.cluster.local
This higher level configuration will be applied first and combined with lower level configuration in the Kured stack itself to arrive at the final configuration applied to the component.
All together, we have found these patterns to be highly effective in the Kubernetes deployments we have worked with our customers on. If you are interested in learning more, we have two open source projects for you to look at. The first, Bedrock, provides a starting point for automating cluster creation and establishing a Gitops CI/CD pipeline. The second, Fabrikate, as mentioned in this blog post, helps you implement GitOps at a higher level of abstraction such that you can share components and definitions across deployments but still vary the configuration applied to them.
We hope these are helpful to you in your own deployments. Feel free to reach out on Github, or on Twitter, with feedback.