Running Concourse-based CD on Azure Kubernetes

It’s been about a year since we started out working with Varian to help their DevOps team build a Kubernetes-native CD stack on Azure AKS. You can read more about their use case at https://customers.microsoft.com/en-us/story/varian-health-provider-azure.

Today we’d like to share how our Concourse-based CD stack looks like, and what it takes to stand up and operate your own full CD stack on Kubernetes.

Interested to learn more? Keep reading below.

Deployment

First, a few words about our main Concourse-based deployment:

  • 60 CD pipelines
  • 400 jobs
  • Deployed on Kubernetes
  • Git is the source of truth for everything — configuration, application data, CD pipelines, and pipeline tasks
  • CD stack is upgraded on a regular basis, including Concourse (from 3.14.x to 4.2.x, including all intermediate versions) as well as other components of the stack — Vault, Consul, etc
  • CD upgrades are done in staging environment first, before upgrading production

Architecture

Open-Source CD stack on Kubernetes

The entire CD stack consists of 10+ open-source components, with Concourse being the main engine for CD pipelines and Vault as a backend for secrets. All components are deployed to Kubernetes via Helm. Of course, there is monitoring, alerting, and log analytics in place, so when something happens with the infrastructure, you’d know right away:

  • CD — Concourse
  • Secrets — Vault (with Consul as a backend)
  • Monitoring & Alerting — Grafana, Prometheus, InfluxDB w/ Telegraf
  • Logging & Alerting — Kibana, Elasticsearch w/ Elastalert
  • Misc — Chartmuseum, Letsencrypt Certificate Manager

Configuration

Important Helm Chart Parameters

Our production instance of Concourse is installed via an upstream Helm chart. Below we are highlighting the most important parameter changes for our deployment. You may find these useful, but your mileage may vary.

# You definitely want to have parts of your database encrypted
# to avoid leaking out sensitive information such as credentials
encryption:
enabled: true
secrets:
encryptionKey: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Since some of our resources/jobs may generate fairly large outputs
# we prefer to minimize volume streaming across workers. Streaming
# of large volumes does fail on occasion and may cause your jobs
# to fail as well
concourse:
web:
containerPlacementStrategy: volume-locality
baggageclaimResponseHeaderTimeout: 2m
# Concourse keeps job logs in the database. Our jobs generate A LOT
# of logs, so it's a good idea to prevent database from growing
# indefinitely
concourse:
web:
maxBuildLogsToRetain: 2000

Managed Database (on Azure)

We decided not to manage our own database, relying on Azure to manage Azure Database For PostgreSQL for us.

See the configuration below — you will need to specify host/db/user/password and certificate. Azure certificate required to communicate over SSL with your Azure Database for PostgreSQL can be obtained by following these instructions.

# Hosted PostgreSQL on Azure, so we don't have to manage our
# own database
postgres:
host: aptomi-xxxxx.postgres.database.azure.com
sslmode: verify-ca
database: concourse-prod
secrets:
postgresUser: aptomi@aptomi-xxxxx-postgresql
postgresPassword: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
postgresCaCert: |-
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----

Workers

Worker tuning turned out to be the most important part of Concourse configuration for us. Having Concourse workers running reliably is extremely important, as they are the ones picking up and executing your jobs. If Concourse workers are not healthy, you CI/CD jobs won’t execute reliably and it’s not something you’d ever want.

With Concourse workers running on Kubernetes, things become a bit more fluid and dynamic. Worker pods can be re-scheduled due to various reasons (node goes offline, node is under maintenance, resource limits are hit, scheduler takes an action, etc) and this behavior is normal. When a worker pod goes away though, Concourse TSA worker management logic, which by default expects a bit of a “static” picture of workers, would often put a worker into a stalled state, and a human operator would need to take an action to recover a stalled worker and re-register it in Concourse.

If you are lucky, the problematic worker can be recovered by running concourse land-worker and fly prune-worker. If you are not so lucky (that happens too), you will have to find the corresponding stalled worker in Concourse database and carefully delete the corresponding entries.

Fortunately, there is support for ephemeral workers (see this pull request), which immediately go away instead of stalling, which makes the life of Concourse Kubernetes much easier. The drawback of this approach is that TSA will clean up information about worker containers/volumes from the DB as soon as the worker is gone. This may lead to certain side affects (e.g. orphan data on persistent volumes), but we still believe that, for the time being, this is better from an operator standpoint than manually dealing with stalled workers.

We are looking forward to improvements in Concourse worker management and disk space management code in 5.0+. E.g. see these for more context:

# Enable ephemeral workers to run on Kubernetes
concourse:
worker:
ephemeral: true
# For persistent volumes, we'd like baggageclaim to use btrfs
# and also increase volume size to 256Gb. Workers download a lot of
# stuff (resources/images/volumes/etc) and we need to ensure there
# is space for all of it. At the very least, set to 128Gb for our
# deployment.
concourse:
worker:
baggageclaim:
driver: btrfs
persistence:
worker:
size: 256Gi
# We also want to raise TSA heartbeat timeout to 120s, so workers
# have reasonable time to respond on Azure AKS, as networking gets
# sometimes flaky during certain parts of the day.
concourse:
web:
tsa:
heartbeatInterval: 120s

Authentication & Teams (on Azure)

If you are on Azure, you will likely want to set up OAuth/OpenID authentication, so that your users can authenticate in Concourse via their Microsoft credentials. Here is how you can configure it with Azure AD 2.0 endpoints:

  1. Look up your OpenID issuer URL via https://login.microsoftonline.com/yourcompany.onmicrosoft.com/v2.0/.well-known/openid-configuration
  2. Create a new app “Concourse” under Azure Portal -> Azure Active Directory -> App Registrations. Specify callback URL https://<concourse-external-url>/sky/issuer/callback, add permission Sign in and read user profile, and finally manually edit app manifest and set “groupMembershipClaims”: “SecurityGroup” to ensure that Azure provides Concourse with a list of security groups for every authenticated user.
  3. Configure OIDC in values.yaml
oidc:
enabled: true
displayName: MyCompany
issuer: "https://login.microsoftonline.com/<mycompanyid>/v2.0"
scope: "openid,profile,email"
secrets:
oidcClientId: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
oidcClientSecret: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Finally, create your Concourse teams. For us, the best practice is that every Concourse team gets mapped to 1 Azure AD security group and 1 local service user. Local service user for every team allows us to programmatically update Concourse pipelines from git, trigger pipelines from other pipelines, and so on. Example:

fly -t main set-team -n myteam --non-interactive \
--local-user=myteamsvcuser \
--oidc-group xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Secrets

Given that Concourse supports Vault as a credential manager and Vault was already used internally at Varian, it was a no-brainer to start using it as a backend for secrets.

At first, it wasn’t entirely convenient as existing secrets had to be copied over from previously defined and well-known locations to the locations where Concourse expects them (i.e. it made secret rotation a bit more challenging). But overall the integration worked pretty well and it was convenient for Dev and Ops engineers to use Vault secrets from the pipelines.

Example of Concourse approle configuration in Vault:

# create approle for concourse
vault write auth/approle/role/concourse bind_secret_id=true
# get role_id
vault read -field=role_id auth/approle/role/concourse/role-id
# get secret_id
vault write -field=secret_id -f auth/approle/role/concourse/secret-id
# create vault policy for concourse 
vault write sys/policy/default policy=@concourse-vault-policy.hcl

where concourse-vault-policy.hcl is:

path "secret/concourse/*" {
capabilities = ["read", "list"]
}

Finally, enable Vault integration in values.yaml for Concourse:

vault:
enabled: true
authBackend: approle
secrets:
vaultAuthParam: "role_id=xxx,secret_id=xxx"

Install & upgrade

Upgrade Concourse using Concourse?…

How exactly do we deploy and manage our production CD stack with 10+ open-source components in it?

First of all, we don’t have just production CD stack. There is also staging CD stack, demo CD stack and a couple of others too.

Initial installation of a new CD environment, as well as management of existing CD environments is done via something that we call infra Concourse. Infra Concourse has pipelines for managing production CD stack, staging CD stack, demo CD stack, etc. So, yes… we do have 4 Concourses, with one managing the rest of them :)

Infra Concourse — small subset of jobs, managing prod and staging CD environments

Deployment parameters for every environment and every of 10+ components of our CD stack (Concourse, Grafana, Kibana, Elasticsearch, etc) are stored in Git. That enables us to manage our deployed CD environments “as code”.

Managing CD environments “as code”

This translates into ability to do the following things in a GitOps way:

  • Install. Deploy all components to a new Kubernetes cluster.
  • Re-install (if ever required). If we want to blow away an existing environment and re-install it, it can be done pretty easily. Everything is in git and 100% repeatable, remember?
  • Change Configuration. Roll out a new monitoring rule or an alert? Change the corresponding file, commit/push them into Git, run pipeline.
  • Upgrade. Change values (upgrade to a newer chart version, upgrade to a new release of Concourse, Vault, etc), commit/push them into Git, run pipeline.

Having a framework like this is very handy. So, for example, when Concourse 5.0 comes out in early 2019, it’s going straight to our staging CD environment through the Infra Concourse. And if all goes well, it’s going to be promoted into production CD environment shortly.

Kubernetes (Azure ACS and AKS)

Initially, we’ve been running on Azure ACS, which was Microsoft Beta Kubernetes Service (not managed). When Azure went GA with Kubernetes, we followed and made a switch to Azure AKS, with Kubernetes masters are managed by the Azure platform. Right now we are on the latest k8s 1.11.5.

The things are fairly stable today, but we did run into several major issues over time. The issues which are worth pointing out are:

  1. Intermittent “TLS handshake timeout error. It may come randomly from any endpoint behind Azure LB (e.g. if you use kubectl to talk to Kubernetes API, or fly to talk to Concourse, or getting a value from Vault, or just opening an application running on k8s in your browser). This error doesn’t happen very often, but it does hurt when it actually happens. Impact example: Concourse job is unable to look up a secret from Vault and fails (it tries to talk to Vault, but gets “TLS handshake timeout” from Azure LB as a reply)
  2. Intermittent network & VM issues (DNS stops working on AKS nodes, Azure AKS VMs requiring reboot to restore the service, etc). Impact example: existing pods can no longer talk to other pods/services in the same cluster, occasionally broken application services such as Consul.

Monitoring & Alerting

We do have health monitoring and alerting set up for every CD environment. So when issues happen, we get Slack alerts and an operators can look into what went wrong:

Slack alerts for CD infrastructure

The bottom line is — even if you are consuming managed infrastructure from a major cloud provider, there will be infrastructure issues affecting uptime of your applications. Keeping your open-source CD stack up and running 24x7 will require time and effort from your DevOps/SRE team. Just something to keep in mind.

Final Words

I hope that operators who run Concourse-based CD will find this summary useful.

In the subsequent posts we are also planning to share the details about our use case for CI/CD and describe what problems Concourse is solving for our end users.

In the meantime, if you have any questions, don’t hesitate to drop me a note at roman@aptomi.io.