Concourse CI
Published in

Concourse CI

Earning Our Wings:

Stories and Findings From Operating a Large-scale Concourse Deployment

  • Better understand the ‘operator’ persona and the needs of an operator when running a multi-tenant Concourse
  • Observe Concourse performance “at scale”
  • Build out a set of recommendations for operationalizing, monitoring and logging a Concourse installation
  • Identify common support issues and their solutions
  • Scaling too aggressively before monitoring is sufficient or an operator “god view” is in place.
  • Not enough debugging information to gather in case of an incident
  • Backing up user data in case of an emergency

Day 1

Grafana dashboards on metrics.concourse.ci
blackbox_job: &blackbox_job
name: blackbox
release: concourse
properties:
blackbox:
syslog:
destination:
transport: tls
address: ((papertrail_log_destination))
jobs:
- *blackbox_job
  • The number of containers on any worker should not approach the 256 container limit
  • Worker volumes and disk usage should remain relatively flat
  • HTTP Response duration should remain flat ( and ideally below 100ms)
  • Goroutines should not be leaked and pile up on the ATCs

Day 2… and Beyond!

162 resources used by 2 tasks

Service-Level Objectives

Wings’ SLO Targets
Wing’s first month of SLO availability
  • Chrome is particularly bad at rendering the data being provided by Elm. If you have really large build histories the current workaround is to use Firefox #1543
  • Our poor performance was in large-part due to inefficient queries against our PostgresDB. We were constantly hitting our database connection limit on CloudSQL. After a lot of debugging, we were able to fix this issue in #1734
Hitting 3 out of our 5 SLO targets in the month of November 2017

Key Takeaways

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store