Sidekiq Apocalypse

Muhammad Arsalan Fayyaz
HeyJobs Tech
Published in
7 min readApr 24, 2023

Introduction

Last six months have been a roller coaster ride for HeyJobs with the Sidekiq and its configurations. It started off with noticing an incident that involved nearly 450K+ jobs getting enqueued and the smallest job processing time was more than 15 minutes. Then in late December, we were notified by the Sidekiq team that they are changing our legacy pricing plan to the new plans. All the pricing is dependent on the number of threads that we consume and it was really important to reduce the total number of threads that we were using without having a major impact on the overall performance of the system.

In this article, I am going to share briefly about the impact these incidents have made and how we come up with the ad-hoc solutions of the problem that not only increased the performance of the system but also kept our payment plan well within reach.

Implications

There was numerous random behaviour started to occur with lots of random delay. We noticed some severe latency spikes for some of our core queues (for example, the processing of our job adverts) which was a bit above 3 hours. But for not so critical queues like email processing, the delay kept increasing and reaching above 8 hours to 10 hours.

Severe delays happened with our core system functionalities that started to update with ~3–6 hours of delay and few functionalities started to delay by around ~8–10 hours. However no major timeout was experienced either by our users or the platform.

There were many root causes of this behaviour as multiple things happened at once which put significant pressure on our Redis clusters resulting in huge memory usage spikes (up to 100%), thus slow processing of jobs and further accumulation of unprocessed jobs which allocated even more memory causing a snowball effect.

Following are the main pain points that we noticed which needed to be addressed:

  1. We used a single Redis cluster both for internal cache and temporary payload storage of Sidekiq jobs which was indeed a bad practice.
  2. We didn’t fully utilise the monitoring for our Sidekiq/Redis clusters.
  3. We only used a single Sidekiq cluster for the majority of our workloads with different teams.
  4. Some older issues also came up that we haven’t noticed before that scheduled around 200K+ unnecessary updates which allocated most of the Redis memory.
  5. We had a large amount of scheduled jobs for one of our core queues residing in a common worker.

Initial Resolutions

Looking at the snowball effect of most of our resources, we quickly came up with a preemptive resolution and decided to do a quick fix first before moving on to apply some permanent fixes.

A few days before the actual incident, we started pausing some of the sidekiq queues in an attempt to relieve the load and allowed the cluster to process accumulated jobs, which did not help eventually. But on the day of the incident, we took some more steps that included:

  1. We confirmed the assumption about the Redis load.
  2. We paused non-critical Sidekiq queues.
  3. We created a separate worker cluster for our core queues that had accumulated most of the scheduled jobs.
  4. We also increased the size of the Redis cluster memory and moved multiple Sidekiq queues/workloads away from the main cluster to the few specific ones that we had already.
  5. We also cleaned up the unnecessary scheduled jobs that accumulated so much of our memory.

Overall, these prompt measures helped us achieve great success in resolving the incident and we were able to process all of the jobs. We also noticed that the overall performance of the Sidekiq clusters improved significantly, and they started to process the jobs faster than before the incident. But the resolution didn’t finish here. We needed to implement additional long-term changes to ensure the system remains stable and operates smoothly over an extended period. To achieve this, we initiated a new project with the goal of enhancing the system as a whole, providing detailed ‘health’ monitoring, and reducing the impact of worker overload, ensuring that unrelated Sidekiq queues do not experience significant delays.

Long-Term Fixes

As mentioned above, we were hosting limited workers and all were using a single Redis cluster to store the queue payload. The same Redis cluster was used for the Rails cache storage. We also needed to improve the overall monitoring of the Sidekiq queues.

We took out the following measures to further improve the performance.

  1. We had to enhance the monitoring. We did have a Datadog Sidekiq monitoring but that was not enough as we didn’t have the visibility into the queue size or latency. We also utilised our in-house Sidekiq monitor, but it often failed to send alerts, as it was a Sidekiq worker itself. This led to delayed or missed alerts from the service. Consequently, we recognized the need to enhance our monitoring capabilities.
  2. We also needed to move the queues between the Sidekiq workers so each development team can have a set of workers and take the ownership of that in serving their queues.
  3. We decided to keep one “shared” worker to support the common or general queues.
  4. We implemented the best practice by using separated Redis clusters: one for the Rails cache and another to support Sidekiq.

Monitoring

Our Datadog dashboard for Sidekiq displays the cumulative figures for various job statuses, including scheduled, successful, enqueued, dead, and busy jobs. To further improve this, we began incorporating queue latency metrics (enqueued time to finish) using the dogstatsd-ruby gem and implemented a new alert system that triggers when specific thresholds are met. We initially focused on critical queues, which now enable us to stay ahead of all Sidekiq-related changes.

Number of scheduled jobs over a period of time. In this case for last 5 mins.

Number of scheduled jobs over a period of time. In this case for last 5 mins.

Infra team disconnected the initial workloads of having one single redis cluster that handles Rails and Sidekiq into two separate Elasticache clusters — Sidekiq and cache. We adjusted their configurations individually to better handle the workloads. Both the clusters were properly optimised on the infra side.

From the backend, we initiated separate workers with different configurations and placed out the relevant queues team wise to each of the workers. We also established proper queue priorities and gave maximum resources and threads to the ones that require more parallel effort. We also increased the concurrency of a few of the major workers to bring down the processing time which eventually helped us.

Legacy Plan > Current Plan — Reducing Threads

Finally, we received unexpected news from the Sidekiq team that our legacy pricing plan would be transitioned to a newer plan. Given that we had over 700 threads at times, this change would result in a fivefold cost increase compared to our current plan. To address this challenge, we took the following measures:

  1. We set up a complete threads monitoring datadog monitoring dashboard. We tried to identify the core queues which take more threads than the rest.
  2. We also identified the overall number of threads getting utilised over a period of time to get more clarity on the actual number of threads getting used.
  3. Based on the above findings, we found out that on average we were using only ~45–50% of the total threads resources.
Graph shows the percentage usage of Sidekiq threads from the pool of reserved threads

Graph shows the percentage usage of Sidekiq threads from the pool of reserved threads

Graph shows the number of threads that we use from the reserved threads pool.

Graph shows the number of threads that we use from the reserved threads pool.

Graph shows the pool of reserved threads available.

Graph shows the pool of reserved threads available.

This gave us a good picture of the total number of threads that we can reduce and also added some new alert mechanism to see if the threads threshold exceeds 80–90% of the quota. With proper monitoring in place and all of the existing setup that we built, we were able to reduce a significant chunk of the threads and that in return saved a lot of money and eventually we were only paying around ~2.5x of the legacy plan saving nearly a half of what we could have been spending on the new plan

Conclusion

Despite these incidents and plan changes occurring in quick succession, our team promptly devised a strategy to resolve the issues, facilitated by clear communication across different teams. Enhancing our monitoring capabilities has proven to be beneficial in the long term, and we now encourage everyone to incorporate all relevant metrics that could help us prevent similar incidents in the future.

Interested in joining our team? Browse our open positions or check out what we do at HeyJobs.

--

--