Distributed Garbage Collection

Shashwathi Reddy
Jun 3, 2018 · 4 min read

Before diving into the weeds of Concourse architecture, let’s briefly take a look at container and volume lifecycles in Concourse. Going forward in this post I will refer to container and volume together as a “worker resource”. In the current architecture, there are multiple ways for worker resource creation to be triggered; like when jobs are started in pipelines or checking for a newer version of a resource. The ATC is responsible for managing worker resource lifecycle like creating, transitioning and destroying them. Workers are designed to be dumb and follow the orders issued by ATC. Removal of worker resource on worker and its reference(in ATC) is done as part of Garbage collection(GC) process in ATC.

GC in ATC includes cleaning up of finished builds, build logs, old resource cache, worker resource cleanup, transitioning workers state etc. One of the classic techniques used for GC is “Mark and Sweep”, it essentially marks the worker resources that are to be cleaned up and then runs another process to actually remove them. In this post we are going to dive deep into worker resource cleanup only. What are the conditions under which a worker resource should be marked for removal? Few reasons could be that containers are around past the expiration time, corresponding volumes of past builds that need not be cached, etc. Once the mark phase is done, ATC would connect to Garden server(container management server) and baggage claim server(volume management server) on each worker and then destroy worker resource iteratively.

During GC cycle, the number of outbound connections from ATC would proportionally increase with number of workers and churn of builds. One potential solution is to increase the number of instances/vms of ATC to distribute load. The problem with this approach is that GC synchronization requires mutex so ATC cannot reap the benefits of multiple instances.

Image for post
Image for post
Current Garbage collection process

What is the simplest thing that we could do to decrease the connections from ATC to worker instances?

One way would be to introduce a bulk destroy API on the worker for worker resources. We implemented a bulk destroy call on the baggage claim server and Garden server so ATC needs to make just one network call for Sweep phase. It was definitely a step forward in terms of reducing the number of network connections from ATC to worker.

We felt the present Concourse design was constricting us from making any further enhancements to GC. In current world, the ATC holds the desired state of the system (i.e., number of worker resources that needs to be created for builds or resource checks). It also holds information on what is the actual state of the system such as worker resources across all workers. The ATC is further responsible for transitioning from actual to desired state too. We spent a significant effort to move this central design system into more scalable and well balanced architecture. The first step towards this change was to make workers smarter. We enabled worker with process to periodically poll the ATC for what is the desired state and also report their actual status to ATC. This design will

  1. Makes worker responsible for maintaining the desired state.
  2. Reduce the amount of work to be done by ATC to maintain worker state.
  3. Make workers more autonomous.
Image for post
Image for post
Distributed Garbage collection process

We want to continue to make further enhancements on this path of work, and continue to give more power to the worker process; like defining a CRUD API for worker resources. ATC is currently still responsible for creating worker resources so hopefully (not so) far out in the future we can have worker resource APIs manage the lifecycle. This will also reduce the intertwined dependency of ATC on Garden server API and baggageclaim server API. It will also also pave the path for any future container/volume management systems to implement these worker resource APIs.

Success on Wings!

Image for post
Image for post
Database Queries by ATC (4 ATCs in this deployment) reduced by 40% after the deployment of GC changes
Image for post
Image for post
CPU Utilization of ATC vm reduced by 10% after deployment of GC changes.

Concourse CI

The Continuous Thing Do-er

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store