Sidekiq and long running jobs

We’re slowly migrating to SOA but still have to take care of many legacy Rails apps. They all use sidekiq for their background jobs. In our enterprise application, jobs may take a long time to complete. Often much longer than SOA-friendly 10 seconds. Quite often we run more than one app on a single machine. Each app spawns another sidekiq process. We have to restart sidekiq on every deploy and when it consumes more RAM than allowed.

In the past, we used monit to monitor sidekiq’s pidfiles. Sidekiqctl was used to stop it, but such approach had some drawbacks. For example, sometimes monit attached to the wrong process after sidekiq has crashed, leaving stale pidfile, and another process has taken that PID. Another problem was that long running jobs has been killed with SIGKILL. It’s how sidekiqctl is working: it sends SIGTERM, then simply waits 9 seconds and kills the process with SIGKILL. We could set deadline_timeout option of sidekiqctl to greater value, but monit would sit and wait until the command exits not doing any useful job.

Things changed with relatively new monit versions having new option — regexp process mathing. We can use it to solve all of these issues. Here is how sidekiq processes are seen in the ps output:

sidekiq 3.2.1 application1 [5 of 5 busy]
sidekiq 3.2.1 application2 [1 of 20 busy]

And here is the stopping sidekiq process, not accepting new jobs (after invoking kill -TERM):

sidekiq 3.2.1 application2 [2 of 20 busy] stopping

All we have to do is to monitor only running processes (but not stopping ones). In this case, when deploying we just send SIGTERM to the sidekiq process. Monit will stop monitoring it and will start a new one. Therefore, old process will finish its job and then will die peacefully. We will use -g option to tell sidekiq the name of our application so we can distinguish one app from another.

Here is the example monit config to monitor two sidekiqs belonging to different applications:

check process sidekiq_application1 matching "^sidekiq.*\ application1\ .*busy\]\ *$"
start program = "/bin/bash -c 'cd /home/rails/public_html/application1/current && env PATH=/home/rails/.rbenv/shims:/home/rails/.rbenv/bin:$PATH HOME=/home/rails bundle exec sidekiq -g application1 -C config/sidekiq.yml -e production -d -t 3600'"
as uid rails and gid rails
if totalmem is greater than 4 GB then exec "/usr/bin/pkill -f '^sidekiq.*\ application1\ .*busy\]\ *$'"
check process sidekiq_application2 matching "^sidekiq.*\ application2\ .*busy\]\ *$"
start program = "/bin/bash -c 'cd /home/rails/public_html/application2/current && env PATH=/home/rails/.rbenv/shims:/home/rails/.rbenv/bin:$PATH HOME=/home/rails bundle exec sidekiq -g application2 -C config/sidekiq.yml -e production -d -t 3600'"
as uid rails and gid rails
if totalmem is greater than 8 GB then exec "/usr/bin/pkill -f '^sidekiq.*\ application2\ .*busy\]\ *$'"

And here is the command to invoke from our deploy recipe:

pkill -f '^sidekiq.*\ application1\ .*busy\]\ *$'; monit reload

Please, notice monit reload part here. We had this setup, but with no monit reload, working ok for about 2 years, then all of the sudden it stopped working after an OS upgrade. It turned out that somewhere in between v5.11 and v5.16 monit started to cache pids. Therefore, it doesn’t notice any changes unless PID has changed, even when regexp doesn’t match anymore. That’s why we added monit reload command.