GitOps Part 4 — Application Delivery Compliance and Secure CICD

Published in

Weaveworks Blog

9 min readSep 28, 2018

Alexis Richardson, CEO & Founder, Weaveworks is the original author of this blog

This blog post is aimed at Kubernetes users who have adopted Continuous Integration (CI) and who want to add Continuous Deployment (CD). I want to home in on security and compliance. Today we’ll show how your continuous delivery pipeline can be more secure. We also demonstrate that using GitOps best practice enables a complete audit trail of system changes.

Getting started

I shall assume you have already read about GitOps and high velocity CICD for Kubernetes. In that post I recommended you consider three ways to do GitOps:

In this post I shall focus on Weave Flux to demonstrate the key points about security and compliance. Flux is the core of our Weave Cloud deployment service for Kubernetes and enables the best practices that I set out below. I think it’s easier than doing everything by hand or by adapting an older CD tool to Kubernetes, but it is up to you.

Three best practices

This is what we think you need to do:

Keep a record in Git of important interactions with the system: who made changes, when and why
Don’t rebuild images from scratch unnecessarily, if you can update config instead. Build each container image just once and ‘promote’ it through each test sequence / environment, do not rebuild each time. But you must still update your declarative config changes in Git.
Use pull based deployment — do not let CI push updates into the Kubernetes cluster or use kubectl by hand

I shall now discuss these. I’ll start with the last one, push vs pull.

Push vs pull deployment

In the previous blog post and accompanying video and slides, we talked about how Weave Flux implements the Kubernetes operator pattern. In plain terms an operator is an actor that is managed by Kubernetes and can inherit the cluster’s configuration, security, availability, etc.

Doing this leads to better security “out of the box”. Why? Flux is an agent that lives inside your Kubernetes cluster. It listens for updates to all code and image repos that it is allowed to access, and it pulls images and config updates into the cluster.

The pull approach is more secure because Flux can:

ONLY carry out operations permitted by Kubernetes role based access control (RBAC), policy and security. Trust is shared with the cluster and not managed separately.
Bind natively to all Kubernetes objects and know whether operations have completed or need to be retried.

This is in contrast to the “push” approach which is typical today:

An actor process lives outside the cluster and is responsible for deployment orchestration, by executing commands that load images.
e.g.: typing kubectl at the command line to execute a direct update, or encoding updates in scripts that run as CI jobs.

If implemented with care this model can be secure because it still uses RBAC in the cluster and in Git to constrain interactions with production. But it is easier to mess it up. You are working outside the trust domain of your cluster, and integrating with it. So you have to set up the whole authentication dance by hand, and take care of hardening yourself. All this is tricky. It is also not very amenable to change without a rewrite. This is why CI systems can be an attack vector for production. Overall if used carelessly, CI can be an entry point to your systems.

How secure is your pipeline?

If you are interested in this area please: *do* take a look at the deeper post “How secure is your CICD pipeline?” by our PM Stuart Williams, which maps a “Security by Design” model to CI systems access permission.

The role of CI in GitOps

In GitOps, the CI system does not have direct access to the cluster at all. You still use CI to run builds, regression tests and so on. Then CI writes updates and images into the relevant repos. This works just great! But don’t use CI to push updates directly, and avoid kubectl if you can.

Speed of updates and recovery from failure

If your CI pipeline pushes changes into the cluster using scripts, you may have noticed that it can take its time, and sometimes breaks. The main reason that updates break is that scripting is brittle. And if there is a failure, your state may be unknown.

Using our recommended approach is fast and more robust. Flux accesses the cluster locally and natively, and is only limited by when (and how) it is notified. You can run multiple Flux agents easily. The Flux agent lifecycle, failure, recovery, availability, and scalability are all managed by Kubernetes. And Flux uses Git as a record of cluster changes. So if something goes wrong during a deployment, you can always recover, or if necessary rollback, and so on.

Note: of course, pushing built images to container image repos can also be slow. But usually the failure cases are easier to understand.

Testing in production

Here is a rather fine image created by Cindy Sridharan, @copyconstruct on Twitter.

This image is from one of Cindy’s tweets about her Observability book. The relevance to our blog post is that we are seeing more and more “testing in production”. Deployment and Release have to be fast, safe, and secure. That’s easier when you use the pull-based approach to deployment, eg. Flux for Kubernetes. High velocity CICD is great because you can test fast and do what is being called continuous experimentation for customer happiness and profit :-)

Don’t rebuild images if you can change config instead

In GitOps, ideally, we have a complete description of the desired state of the system. Git is our source of truth for this desired state. As described in our earlier pipeline post, in GitOps we are building on the following best practices:

DevOps & Git backed pipelines
Infrastructure as code, aka config-as-code
Immutable deployment artefacts

We are also combining these practices with Kubernetes and other cloud native technology, into what we hope will become a fully declarative application delivery model.

Combining DevOps with Kubernetes and cloud native practices has consequences:

All config is declarative and everything can be described and observed
Config can be mutable even if images are not
We can unbundle configuration from build, and update it independently
We move from “config as code” to “ops as config”.

So what do these points imply for best practice?

From immutable infrastructure to mutable config

Immutable infrastructure is best practice, but *what* counts as infrastructure may be evolving. With containerisation, we can build code and other source information into immutable container images. With declarative configuration, we can parametrise our system as a set of values and keep all that in Git. Those values may be altered at runtime, making our system partly mutable.

Kubernetes YAML files are examples of declarative configuration. If suitably authorised and authenticated, we can update these. GitOps encourages this, because (a) you don’t always need to or want to rebuild images from code, to make an important system change, but also (b) you want all system changes to be described in Git — the source of truth for your desired state.

Weave Flux handles this case by observing the config repo as well as the image repo. If the config is updated, and the images remain unchanged, Flux will orchestrate a deployment within the Kubernetes cluster to update the application. As with pull-based deployment this has several functional benefits:

Faster & more robust: you don’t always need a rebuild to apply changes in the correct manner. Rebuilds add latency to your delivery cycle; and may not succeed every time.
Reduced attack surface: Provided you use a sensible access control model for “who gets to change things and when”, it is helpful to have a category of application update that is robust, recorded in Git correctly, but without requiring code changes.
Supports mutation patterns like latching and canary while remaining within the GitOps paradigm. For example, an administrator can do incremental rollout while taking snapshots of the last good state.

But here is the best part:

The approach is additive. You don’t get rid of your existing CI. You just make sure that it writes to image repos and then you add Flux. Finally, all this scales to multi-cluster pipelines where deployment may be to Dev, Test, UAT and Production clusters.
A guide to multi-cluster pipelines in GitOps is provided here.

Finally, let’s turn to the third best practice, in which Git helps us record changes.

Record everything in Git to have audit and compliance

Weaveworks customers are using GitOps practices today in order to pass SOC 2 compliance audits. The big deal here is that normally people tell you to buy an expensive compliance product to pass these tests. We think that you don’t have to buy that expensive product, at least not for many day to day cases.

If you do things right, the auditor can look at Git and see who made any changes, when and why, and how that impacted the running system deployments. Note that the same approach can help with HIPAA and some PCI too. And all these tests famously introduce process overhead. In GitOps that is absorbed into normal developer practice. Let’s look at why that is.

Understanding SOC 2

The core idea is that a company needs to have divisions between roles and rules on what they can do, and keep a record of obedience. For example there must be a bright line between who can change production code and who can change the monitoring of production code. Theoretically this implies that the company would need two bad actors to sneak some evil code into production. Managing this is hard in a fast moving delivery environment with small agile teams, which is why the traditional enterprise approach is to quash velocity with process.

File integrity monitoring

I’d like to quote from the Threat Stack blog: “When’s the last time someone made an unauthorized change to your system files?”. It turns out that provided we can carve out suitable roles, we can track such changes in Git. You need at least two roles:

Role has write access to source
Role can look at production but is limited to no changes (or “some”)

You can use GitHub RBAC to control who is in which teams and thereby has access or not. This provides you with roles. And GitHub keeps track of every change, when it happened etc.

Adding Weave Flux to complete the audit trail

Capturing source changes is not enough. We must also track releases, staging and rolling deployments. And we must understand how image and config changes map to live cluster objects. This is enabled by using Weave Flux which itself keeps notes for you, and writes them all into Git. This means that your desired state is up to date, correct and observable. And everything that you needed to record has been kept for that day when the auditors visit.

GDPR and security by design

A quick note for GDPR folks, please skip otherwise. “Security by design” is a requirement of GDPR. Our approach (GitOps) is aligned with OWASP, a major project which defines practices for this. In particular we believe our best practices adhere to these OWASP info sec principles:

Confidentiality — only allow access to data for which the user is permitted
Integrity — ensure data is not tampered or altered by unauthorized users
Availability — ensure systems and data are available to authorized users when they need it

Summary & next steps

What are you waiting for? Please try Weave Cloud now and let us know what you think. We are excited about the security and compliance benefits. If you are using containers and/or Kubernetes and other cloud native tools — I want to hear from you.

Remember that we help you implement the three best practices, and that these can help you be secure and compliant without buying hefty and expensive “solutions”.

Keep a record in Git
Avoid unnecessary rebuilds if you can use config instead
Use pull based deployment

Other blogs in this series include:

Alexis is the co-founder and CEO of Weaveworks. He is also the chairman of the TOC for CNCF, and the co-founder of the Coed:Code meetups. Previously he was at Pivotal, as head of products for Spring, RabbitMQ, Redis, Apache Tomcat and vFabric. Alexis was responsible for resetting the product direction of Spring and transitioning the vFabric business from VMware. Alexis co-founded RabbitMQ, and was CEO of the Rabbit company acquired by VMware in 2010, where he worked on numerous cloud platforms. Rumours persist that he co-founded several other software companies including Cohesive Networks, after a career as a prop trader in fixed income derivatives, and a misspent youth studying and teaching mathematical logic.

Originally published at www.weave.works.