Blue-Green deployments for Sidekiq workers in docker

Bernat Rafales
Code & Wild
Published in
6 min readFeb 5, 2019

It’s quite common nowadays to do blue-green deployments. The way they work is quite straightforward. Whenever you need to deploy a new version of your code to production, you go through the following steps:

  • A new version of your production infrastructure is created, with the new code in it, while the old version remains untouched.
  • Once the new infrastructure is up and running, some process checks if it’s healthy. For web applications this usually means making an HTTP request to a specific endpoint that guarantees that the new version of the app is ready to take in requests.
  • If and only if the health checks pass on the new infrastructure, then the traffic gets redirected from the old infrastructure to the new one.
  • When all traffic is being routed to the new infrastructure, the old one gets torn down.
Step 1 of a blue-green deployment. New code is started separately

If you are using AWS Fargate, you can get blue-green deployments quite easily by using AWS application load balancers and target groups. By configuring your own health check endpoint, AWS will deal with all the complexity of the above steps, including the traffic routing, which can be tricky to implement yourself. This is great to ensure you get no downtime when deploying your code, which is a must if you aim to deploy often (and we do).

Step 2 of a blue-green deployment. New code is ready and starts taking traffic

However, these days not even “old fashioned” monolithic web applications consist of just web servers. Most of them will also have some sort of background job processing to deal with stuff that is better to deal with asynchronously (like sending emails). At Bloom & Wild we use Sidekiq. Usually downtime for background workers is slightly more acceptable than for customer facing APIs. However, if some of those jobs are time critical, we want to ensure there’s also always a worker processing them, even in the event of a deploy.

Step 3 of a blue-green deployment. All traffic is sent to the new code and the old code is removed

Turns out background job workers are not exposing any HTTP connection to the outside world, because their job is not to serve HTTP traffic, but pull work from some sort of queue and get on with the job. For cases like this, what AWS Fargate will do during a deploy is slightly different:

  • A new version of the worker with the new code gets started.
  • Once the container has started, Fargate will consider the task ready to do its job.
  • After the new task is started, Fargate will give old containers a 30 second grace period to finish whatever they’re doing and shut down gracefully.
  • After the grace period, if any old containers are still around, they will be terminated.

The interesting thing about this process is that the fact that a docker container has started doesn’t mean it’s ready to start doing its job. In our case, for example, some of the workers run on instances with a very small CPU allowance, because they are mostly dealing with I/O stuff like talking to third party APIs, so when a docker container is up and running, we still need to wait for a script to boot up the whole Rails application, ensure migrations are up to date (to guarantee none of the new code runs if it has a dependency on a migration), and then start again as a Sidekiq worker. This process takes time, especially when run on a host with limited CPU, and so after the 30 seconds grace period, old workers get killed and the new ones are not yet ready to accept jobs, resulting in high latency for critical jobs.

Timeline of a worker deploy in AWS Fargate. There’s a risk of an interval where jobs are not being processed

Docker native health checks to the rescue

Fortunately for us, turns out Fargate has support for native docker health checks. This means that we can emulate the same functionality that application load balancers use for web containers within our Sidekiq workers. Docker health check endpoints allow you to define an arbitrary shell command that will be run inside the container. If the exit code of the command is zero, the health check will be considered a success. If it’s anything else then it’ll be considered a failure. To use these health checks on Fargate, you need to specify them as part of your task definition. You can specify the command you want to run, when do you want to start running it, and also give it a timeout and a number of retries before giving up.

For example, we could have a set up that runs the health check every 10 seconds, and 5 retries, with a grace period of 60 seconds. This would wait a minute after the container has started, then would run the health check command every 10 seconds. If the result is not a success, it would try up to 5 more times and then it’d flag the container as unhealthy. On the other hand, if a successful health check comes back before those 5 attempts, the task will be considered healthy. The nice thing is Fargate will only then tear down old containers if the new ones are healthy, avoiding downtime.

How do we know when a Sidekiq worker is actually up and running?

The docker health checks don’t solve the problem on their own, though. We still need to come up with a reliable method of knowing if a Sidekiq worker is actually up and running and processing jobs. Checking if the Ruby process is running is not enough, because as we’ve explained it can take a long time for our Rails app to boot.

Luckily, it turns out Sidekiq has built in callbacks that will let you run any Ruby code whenever the engine starts, receives the shutdown signal, or shuts down. This is perfect for our purposes. What we can do is hook into the started engine callback, and in there create a file in the container file system to indicate that we have reached that step. Then our docker health check can be as simple as checking if that specific file is present in the container. The moment it’s there, we know Sidekiq has started, we can then mark our container as healthy, and the old ones can go away!

Deploy of a new worker with health checks. There’s always a worker processing jobs 🎉

Show me the code!

Here is the code snippet you can use to enable this health check in your Sidekiq workers (it would usually be part of your Sidekiq initializer):

On startup, we use the FileUtils module to create an empty file on the tmp/pids folder with the name sidekiq_started.

And here is an example task definition for AWS Fargate that will work together with the aboveSidekiq configuration (only the relevant bits are shown here):

The command we use as a health check is stat /app/tmp/pids/sidekiq_started || exit 1. The stat command will return a successful exit code if the file you give it as an argument can be found in the file system. The check assumes your Rails app lives in the /app folder inside the container.

And this is how we achieved no downtime deployments at Bloom & Wild not only for our web facing applications, but also for our background workers in AWS Fargate.

Note: if you’d like to read more about how we migrated our Rails application to AWS Fargate, you may be interested in reading Migrating our Rails application to AWS Fargate.

--

--