Deleting millions of associated records in Rails

Tim Baker
Specle Engineering
Published in
3 min readApr 7, 2017

tl;dr
If you’re trying to delete a record with millions of dependent associated records and can’t rely on foreign keys, you also can’t rely on dependent: :destroy or even dependent: :delete_all.

Consider destroying the troublesome child records in a loop using raw SQL before deleting the parent record.

********

Specle has a large MySQL database, and a policy to keep everything means we have every ad ever uploaded since 2010 still stored, irrespective of whether it was ever sent, along with various dependencies. This adds up to tens of millions of records that are no longer needed and have not been accessed in years. The strain on the database was getting obvious so we decided to do something about it.

Specle Send (our ad send service) allows users to upload adverts, have them checked against our specifications database and send them to publishers. This product is the source of the majority of the records in the database, so we decided to delete all ad uploads that were more than 6 months old and had never been sent. About 65,000 records.

There was no reason for us to believe any of these records were anything special, the primary model has 5 dependent: :destroy has_many associations, and each parent should have no more than 20–30 children in any one association.

Our first solution was a rake task making use of a new RecordRemover class that takes collections as arguments.

RecordRemover calls destroy on each record as we wanted to respect the children’s dependent: :destroy associations.

class RecordRemover
def initialize(*collection)
@collection = collection.flatten.compact
end
def remove!
destroy_collection
end
private attr_reader :collection def destroy_collection
collection.each do |record|
print “\033[94m.\033[0m”
record.destroy
end
end
end

Rake task:

namespace :destroy_insertions do
desc “This task destroys unsent insertions created 6+ months ago”
task unsent_and_more_than_6_months_old: :environment do
insertions = Insertion.unsent_and_over_six_months_old
RecordRemover.new(insertions).remove!
end
end

I started the rake task and went to lunch.

On return it had stopped, unfinished. Something had happened to knock over my reasonably specced Macbook Pro. After some digging it emerged that a Specle Send from 2012 had somehow acquired 40,000 children on one of it’s associations. So as our relationship was dependent: :destroy we were instantiating each child and calling destroy, which very quickly runs out of memory and crashes out.

We thought this was a one off, so simply front loaded the rake task to clear out the audit items in batches before deleting the insertion itself.

insertions.each do |insertion|
insertion.audit_items.find_each(batch_size: 2000) { |batch|
RecordRemover.new(batch).remove!
}
end

This again ran fine for a while, until an hour later I realised that the rake task had been deleting relations of a particular Specle Send for some time.

This Specle Send had 1.6m dependent: :destroy records, further investigation identified 10 more parents with very large numbers of children. The Specle Sends affected had been stuck in a loop, creating these monstrous associations, but being all historical it was not clear how this had occurred.

Evidently the current solution was not appropriate for deleting the tens of millions of records we were now dealing with.

Our solution needed to be fast and not lock the database table in question. Ultimately we devised a loop which creates sql and executes it directly on the database.

namespace :destroy_insertions do
desc “This task destroys unsent insertions created 6+ months ago”
task unsent_and_more_than_6_months_old: :environment do
insertions = Insertion.unsent_and_over_six_months_old
insertions.each do |insertion|
puts “deleting insertion ##{insertion.id}’s audit items”
batches_count = insertion.audit_items.count/4000
(0..batches_count).each do
puts “deleting batch of up to 4000”
query = “DELETE FROM audit_items WHERE auditable_type=’Insertion’ AND auditable_id=#{insertion.id} LIMIT 4000”
ActiveRecord::Base.connection.execute(query)
sleep 0.1
end
end
RecordRemover.new(insertions).remove!
end
end

It may not be as pretty or concise as writing pure ruby or depending on the associations to handle the deletion, but it’s orders of magnitude faster.

Deleting 75k associated records using the RecordRemover in batches vs raw SQL:

require ‘benchmark’puts Benchmark.measure {
insertion.audit_items.find_each(batch_size: 2000) { |batch|
RecordRemover.new(batch).remove!
}
}
User CPU: 93.680000
System CPU: 10.500000
User+System CPU: 104.180000
Total elapsed real time: 219.673786
**********puts Benchmark.measure {
batches_count = insertion.audit_items.count/4000
(0..batches_count).each do
puts “deleting batch of up to 4000”
query = “DELETE FROM audit_items WHERE auditable_type=’Insertion’ AND auditable_id=#{insertion.id} LIMIT 4000”
ActiveRecord::Base.connection.execute(query)
sleep 0.1
end
}
User CPU: 0.070000
System CPU: 0.000000
User+System CPU: 0.070000
Total elapsed real time: 4.358903

--

--

Tim Baker
Specle Engineering

A newbie Ruby developer @specle and Streatham enthusiast. Oh, and I cycled across China https://www.youtube.com/watch?v=iB88OXk0ipo.