Automated and Stateful Patching of Consul Clusters
On the path to no ops
Consistent patching of a machine’s operating system is an extremely important process to protect against the latest security vulnerabilities and bugs. In many large organizations such as Capital One each series of patches are applied as new machine images. All that is required to update an application is to destroy the old machines and recreate the application using machines based off of the new image. This is simple in a stateless system since the machines can simply be deleted and re-created with the latest patch without any problems. However, if you have a database or if you have any information that you need maintain, patching becomes much more challenging. In this case you will have to save or move the data to a different location before updating the machine and finally retrieve the data once again. This process too is not problematic and can become automated with some configuration.
However, the difficulty that many engineers are now facing is maintaining state while patching a distributed system, which in my case is a cluster of consul servers running on Amazon’s Elastic Container service (ECS). Consul is a service discovery and health checking tool that allows for DNS or HTTPS to find the services they depend upon and provide routing to healthy hosts. Furthermore, Consul is a distributed and highly available system built on a consensus protocol based on Raft.
What makes patching machines running a Consul cluster so challenging is that Raft requires a majority of servers to continuously run in order to keep the cluster’s state. So, if there are three Consul servers in a cluster, then at least two will need to be running at all times.
There are benefits to this set up. If one server becomes disconnected from the others, it can regain the correct state once it rejoins the cluster (since the majority of servers will override the minority). While this provides for a fault tolerance and highly available system, it can be difficult to update the underlying machines since a majority of servers will always need to be running.
Typically, this is a very manual process. One new Consul server is created and added to the cluster before deleting an outdated one. This is repeated for each server until no outdated servers are left. In this walkthrough, I will demonstrate a process for automatically updating Consul’s underlying machines whenever new operating system updates become available.
- First, we will create a framework that uses Consul’s blocking calls in order to listen for events related to spinning up or removing machines.
- Next, we will go over the lock strategy used in order to guarantee one-time execution of actions.
- Finally, we will integrate what we have built with CloudWatch and Lambda to trigger the entire patching process whenever an updated Amazon Machine Image (AMI) is created and registered.
Building a Consul Blocking Framework
The first part of this walkthrough will go over the triggering and execution of the machine creation and deletion functions. The functions themselves are fairly basic.
- The creation function looks to see if there is an updated AMI image. If there is, it creates a new machine with a new Consul server.
- The deletion function looks to see if there are extra machines running Consul servers. If there are, it deletes the oldest machines.
In order to avoid continuously calling the functions to check if they should be executed, I decided to have them run asynchronously using Consul’s blocking queries. This allows the functions to utilize long polling and ensures they are being executed only when a change occurs. This simplifies the logic and saves on processing power and complexity.
Instead of building a blocking query for each of these functions separately, we will build a basic Consul blocking framework, so that in the future any number of consul blocking queries can easily and efficiently be incorporated. The framework will be built using Python’s Asyncio library and use decorators to create the Consul blocking queries. The decorator creates a coroutine based on the consul blocking query and outputs the result of that query into a queue. This ensures that the events are decoupled from the function executions. This is important if a function fails or times out since the event can be put back in the queue on failure, ensuring it will eventually be executed. In addition to the queue, there is also a handler, which is a coroutine that polls the queue for new events and calls the corresponding function.
The complete flow is shown in the diagram below.
Now, it is possible to add any number of Consul blocking queries that can correspond to various actions in Consul. Below is an example of how I set up the creation and deletion functions — which are called “spin up” and “spin down”. Spin up creates a coroutine that awaits the query consul.event.list(name=consul-rehydrate) which gets triggered when an event called consul-rehydrate is called. Spin down awaits consul.catalog.service(service=consul-dashboard) and its triggered when any change in the consul-dashboard service occurs (which is related to the number of Consul servers).
Finally, the logic for each of these functions is simply added to the async definitions and is called by the handler when the respective event is triggered.
One weakness of this method is if the machine running this system fails, then the events are not guaranteed to run. In order to add resiliency, this system should therefore be run on multiple machines. However, now we run the risk of each function being run multiple times, which can cause too many new machines to be created or destroyed. In order to avoid this, and get as close to a one-time execution as we can, we will use Consul’s built in locking functionality.
By running the blocking queries on multiple machines, we can achieve at least once execution. However, we want to avoid running the action multiple times on the same event, which is where locks come into play.
Locks can be very difficult to implement in a distributed system when the state and timing between machines may vary. Luckily, Consul has an easy-to-use locking mechanism built into its KV store. Each handler will attempt to take out a lock before performing the function related to the event. If no lock is currently taken, then the first handler will take it and update it with its unique session ID. All subsequent handlers will receive a lock-taken response and will not perform the action.
However, if the first handler fails, there is no guarantee that the action has been executed. In order to ensure the action has been executed we can simply implement a receipt functionality on completion of an action. Now the other handlers will wait for the receipt before removing the event from the queue and moving on to the next event. One problem that can arise is if the action fails and the receipt is never sent then the other handlers will keep waiting indefinitely. Therefore, we can set a time to live (TTL) for the handler performing the action and the handlers waiting for the receipt. If the handler fails to complete the action during it’s TTL then it is destroyed and the lock is now free to become acquired by another handler.
While the action is not guaranteed to be executed exactly one time because it would fail if all three handlers are destroyed, if that happens then you probably have a bigger issue with the logic inside the action itself or the actual cluster. You can also increase the resiliency by simply adding as many additional handlers as you would like.
Lambda Firing Consul Event
The final part of this walkthrough will focus on triggering the Consul creation whenever a new updated AMI is registered. This is accomplished by creating a new CloudWatch event based on the RegisterImage event. This event triggers a Lambda which will fire an event consul-rehydrate to Consul.
The tricky part here is how to fire an event to Consul in the ECS cluster without knowing the instance IP address or having to expose the Consul port. We can work around this by creating an ECS task that spins up a Consul container in the cluster with the fire event command (which simply executes the fire event command, and then removes the container). The only variable required is the ECS cluster name which is supplied as an environmental variable to the Lambda function.
So far we have designed a flexible consul blocking framework to allow for any number of consul queries to listen and wait upon defined events. We initially created two listeners, one designed to spin up new machines when an updated AMI becomes available, and another to remove the outdated machines. We then added resiliency to this solution by utilizing consul’s built in locking mechanism to add distributed locking functionality. Finally, we integrated with Amazon’s Cloudwatch and Lambda services to regularly trigger the patching process.
All that is left to do is to create a Docker image — with the Consul blocking framework and the creation and deletion functions mentioned earlier — and upload that image to a Docker Registry. Finally, we would add three containers using this image to the ECS cluster and create the Lambda mentioned above to point to the cluster. Now we have automatic patching for all of our machines running Consul! With proper monitoring your cluster now requires little to no operations to maintain and will continuously keep updated with the latest security patches and software. You should be able to use these steps and methodologies to patch the underlying machines in any distributed system that requires state to be kept while minimizing any operations involved.
DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2018