Little Dark Age
Here is a tale of how we faced unknown errors on our platform because of the ambiguous working of Delayed Extensions in Sidekiq. Let’s see how we debugged and resolved the issue.
Prologue
Not so long ago, we took on a quest to improve our platform’s performance to provide a better experience to our customers. And so began the journey to handle our CPU and Memory spike issues.
Note: I am using Sidekiq 6.4 in my application.
Let’s take a moment to review the problem at hand, consider a Sidekiq job that you have in your application, as demonstrated in the following snippet:
class Job
include Sidekiq::Worker
def perform(file, email)
request_time = Time.zone.now
rows = file.read
rows.each do |row|
errors = validate(row)
return if errors.present
Model.create!(row, request_time)
end
SuccessNotification.delay(queue: critical).send_update_to_users(request_time, email)
end
end
class SuccessNotification
def self.send_update_to_users(request_time, email)
data = Model.where("created_at between ? and ?", request_time, request_time)
return if data.blank?
send_email(email, data)
end
end
The above job is fairly simple, it takes a file, validates all rows, creates records based on rows, and sends a notification against all the rows simultaneously.
The problem with this is, that when the count of rows goes north of 10_000 to 15_000, we have noticed a sudden bloat in memory/CPU Spike.
We replaced the single job processing all records with a simpler workflow using Sidekiq::Batch.
The modified job looked something like this:
class Job
include Sidekiq::Worker
def perform(file, email)
request_time = Time.zone.now
rows = file.read
batch = Sidekiq::Batch.new
batch.on(:success, "Job#send_notifications", request_time, email)
batch.jobs do
rows.each_slice(50).each do |row_chunk|
Job.delay(queue: default).create_data(row_chunk)
end
end
SuccessNotification.delay(queue: critical).send_update_to_users(request_time, email)
end
class << self
def create_data(rows)
rows.each do |row|
errors = validate(row)
return if errors.present
Model.create!(row, request_time)
end
end
def send_notifications(request_time, email)
SuccessNotification.send_update_to_users(request_time, email)
end
end
end
class SuccessNotification
def self.send_update_to_users(request_time, email)
data = Model.where("request_time = ?", request_time)
return if data.blank?
send_email(email, data)
end
end
Now, the only thing you need to know is that earlier, we were processing all rows and then triggering a notification job, but after performance issues, we moved to a more optimized workflow that creates a batch of multiple jobs processing rows concurrently; joining individual jobs at the end with a callback to send email notifications.
The onset of darkness for users
Everything was working great, users were uploading Excel sheets with lots of rows, yet we did not see any CPU/ Memory alerts. However, something was wrong. You know how they say,
“A single crack in the wall can topple an entire fortress.”
We suddenly started noticing a sudden drop in notifications. The file and rows were processed properly on our application, however, the users were in complete darkness (lack of visibility) because they weren’t receiving any notifications for the rows being processed.
Hunting Down the Agent of Darkness
Do not worry my friends for the Bounty Hunter of Bugs came to our relief and he shall crush the living soul out of those pesky little creatures.
Why were notifications not being sent?
As you can observe, the parameters given to the SuccessNotification
job contains a Time object, used to query the database to get records created so that we could send emails accordingly. However, the query did not return any objects despite them being present in DB, hence no emails were sent.
We have a “request_time” attribute for all records, irrespective of when they were created. Request Time refers to the “start of the process.”
Why was the query returning no records?
Upon carefully observing the logs of these jobs, the bounty hunter figured it out, the request_time being passed had changed, not completely, the date was correct but it had lost the aura of Data Type and Precision unlike previously.
Argument Before Refactoring:
"---\n- !ruby/class 'SuccessNotification'\n- :send_update_to_users\n- - !ruby/object:ActiveSupport::TimeWithZone\n utc: 2024-03-15 06:53:13.831263941 Z\n zone: !ruby/object:ActiveSupport::TimeZone\n name: Asia/Kolkata\n time: 2024-06-21 12:23:13.831263941 Z\n - abc@example.com\n"
Argument After Refactoring:
"---\n- !ruby/class 'SuccessNotification'\n- :send_update_to_users\n- - 2024-06-21 11:55:18 +0530\n - abc@example.com\n"
Why was the time object modified to a String?
As it turns out our greatest ally: Sidekiq Pro has a difference of working in DelayedClass and Sidekiq::Batch Callback.
Now since, this was a notification job, in the previous implementation when there was a single job processing all the rows, in the end, we would schedule the SuccessNotification Job. With the newer implementation, the logic was relocated to “OnSuccess” callback.
Sidekiq always stores arguments as strings and parses them at the time of execution as is evident from the above argument logs.
While delaying a Job sidekiq stringifies arguments using YAML.dump and while parsing it uses YAML.load.
However, in case of Callback and all other workers, it has used Sidekiq.dump_json and Sidekiq.load_json
Due to this change in working the Precision was lost and the TimeWithZone object was now simply a String.
Sidekiq Working:
TimeWithZone preserved in DelayedClass:
TimeWithZone to String in Callback:
Delayed Class:
Batch Callbacks:
Darkness was short-lived, for the Hunter now had his Prey!
Now, we can’t change the working of Sidekiq. However, we realized that our existing query was using weird logic.
Our Previous query was as such:
Model.where("request_time = ?", request_time)
For Context:
Users receive notifications instantaneously as in a Single Upload Cycle, which would be the same for all requests, so request_time = <some time>
gave us the results.
But we realized that in a Single upload cycle, we can also have the same request_id
which would be the same for all the rows. So to fix our issue we modified the job as such:
class Job
include Sidekiq::Worker
def perform(file, email)
request_time = Time.zone.now
request_id = SecureRandom.hex # Added a request id
rows = file.read
batch = Sidekiq::Batch.new
batch.on(:success, "Job#send_notifications", request_id, email)
batch.jobs do
...
end
SuccessNotification.delay(queue: critical).send_update_to_users(request_id, email)
end
class << self
...
def send_notifications(request_id, email)
SuccessNotification.send_update_to_users(request_id, email)
end
end
end
class SuccessNotification
def self.send_update_to_users(request_id, email)
data = Model.where(request_id: request_id) # Modified Query
return if data.blank?
send_email(email, data)
end
end
Since we eliminated the use of a Time object, our query will now give us the desired records, hence successfully sending notifications to users as well.
Epilogue: It’s Not Over Yet!
Although the issue is fixed because we simply removed the need to use a Time object, we still do not know WHY. Sidekiq has two different mechanisms for parsing arguments. So I raised a discussion on Sidekiq’s Forum to get the answer.
The creator should answer our dilemmas…
https://github.com/sidekiq/sidekiq/discussions/6340
From this discussion and Sidekiq’s wiki, we realised that sidekiq never intended to use YAML. As in the beginning we saw that we are using sidekiq version 6 which still gives support for DelayedExtensions, however, Sidekiq discontinued it because of multiple reasons, one of which was and I quote:
“they use YAML to serialize arguments so the job payload can become very large easily if passing complex Ruby objects.”
Reference Wiki: Delayed extensions
References:
- Delayed extensions
- Batches
- Use of YAML in Delayed Extensions: https://github.com/sidekiq/sidekiq/blob/6-x/lib/sidekiq/extensions/action_mailer.rb
- Workflows: https://github.com/sidekiq/sidekiq/wiki/Batches#complex-workflows