Quickly Inserting Thousands of Records in Rails

I love ActiveRecord. It does so many things so well. Hiding horribly verbose SQL queries behind concise, readable ruby; magically knowing that the foreign key for has_many :media is medium_id; replacing arcane garbage like “PG::UniqueViolation: ERROR: duplicate key value violates unique constraint ‘index_users_on_email’” with the perfectly end-user-readable “Email address has already been taken”; the list goes on. Of course, the inevitable price for this abstraction is some performance overhead. Usually it’s not a bottleneck, but when you need to do something just slightly more complicated than usual, it can be. One instance of this we’ve run into at Handshake is the need to insert many rows into a join table.

The Problem

This comes up in a few places for us. Definitely the most extreme in terms of sheer number of records we need to create, is related to our user notification preferences. There are three tables involved in those — Users, NotificationReasons (e.g. “someone sent me a message”), andNotificationPreferences (which is essentially a join table between those two). This is straightforward enough under normal circumstances — create or destroy a NotificationPreference to turn notifications on or off for a given reason. When it gets to be a problem is when we create a newNotificationReason. Or specifically, when we create a new NotificationReason and want a certain set of users to have that notification on by default. We’ve now put ourselves in a position where we need to create one NotificationPreference per user, which is a lot if our “certain set of users” is, say, all students and alumni (about 1.4 million as of writing this).

Another example is our mass email system. We support users in the career services office sending mass emails to groups of students at their school, up to and including the entire student body. And if a user wants to email 10,000 students, we create a MassEmail record which is joined to those 10,000 Usersthrough a MassEmailUsers join model. As of now we create this in a web request, not in a worker, so it needs to happen as fast as possible.

Approaches via ActiveRecord

There are a few ways we could create these, but say we did something simple like:

users.each { |u| u.mass_email_users.create(mass_email_id: 42) }

I ran this in my test environment with 1000 users, and it took about 2.8 milliseconds per record. That doesn’t sound too bad, but multiply that by 10,000 and we have a user who’s waiting 28 seconds for their mass email to get built. Not only is it an awful user experience to wait that long for a page to load, we’re also blocking that entire process from serving other requests during that time (which would be a significant problem in our high-throughput app). Looking at our other use case with notification preferences, doing this 1.4 million times means it could take over an hour to turn these notifications on, and other migrations we may need to run will be blocked on that process. We definitely need to speed this up.

One of the reasons this takes so long is that ActiveRecord is running validations on each one of these records individually. We can get around that, if we’re certain that the records we’re creating are valid (and there’s no user-entered data here, so that should be something we can be certain of), by doing this instead:

users.each { |u| meu = u.mass_email_users.new(mass_email_id: 42); meu.save(validate: false) }

Timing on that one came out to about 1.75 milliseconds per record. Definitely an improvement, but it’s still a 17 second web request or a 40 minute migration. The real problem here isn’t validations; it’s that we’re creating all these records individually, and ActiveRecord is instantiating each of them as an object in memory. To get around that, we’ll have to roll up our sleeves and write some raw SQL.

SQL Fun

Thankfully, this sort of bulk operation is pretty much what SQL is built for. We can create all these records in a single SQL operation:

INSERT INTO mass_email_users (mass_email_id, user_id) VALUES (42,1),(42,2),...

And we execute that via ActiveRecord like so:

values = users.map { |u| "(#{mass_email.id},#{u.id})" }.join(",")
ActiveRecord::Base.connection.execute("INSERT INTO mass_email_users (mass_email_id, user_id) VALUES #{values}")

That’s great, but what if we don’t know for sure that all these things we’re creating are valid? Unfortunately, we’re out of luck if we want to support any arbitrary ActiveRecord-defined validations on all these, without going back to our original problem. However, the only validations we’re likely to have on a model like this are existence and uniqueness, which are pretty easy to check for ourselves before we do the actual insert. Existence just means filtering any nulls/blanks out of our input, and uniqueness can be guaranteed by removing duplicates in the input, and checking that we aren’t inserting something that already exists. This can be done with one additional SQL query, something like this:

def fast_join_insert(join_model, static_column, static_id, variable_column, variable_ids)
ids.reject!(&:blank?)
ids.uniq!
existing_ids = join_model.where("#{static_column} = #{static_id}").pluck(variable_column)
ids -= existing_ids
values = variable_ids.map{ |id| "(#{static_id},#{id})" }.join(",")
sql = "INSERT INTO #{join_model.table_name} (#{static_column},#{variable_column}) VALUES #{values}"
ActiveRecord::Base.connection.execute(sql)
end
fast_join_insert(MassEmailUser, "mass_email_id", mass_email.id, "user_id", users.pluck(:id))

I should also note that while this shouldn’t create invalid data, it’s also backed up by unique indices and non-null constraints on the database level. This prevents us from getting into a really bad state if multiple processes call this concurrently or something. We can also rely on the foreign key constraints there to be sure that these values we’re inserting point to actual users. So if you’re doing something like this, be sure you have those things in your DB, in addition to the analogous ActiveRecord validations (and really that’s just best practice anyway).

Results

Our actual implementation of this has quite a few more parameters, handles sanitizing things before interpolating them into the SQL strings, and ends up complex enough to deserve its own class rather than just a method. But this is still the basic pattern it follows. Using our actual FastJoinInsert (which we’re working on open sourcing; stay tuned) to create the 1000 records as before, the timing works out to 49 microseconds per row, so we could build the 10,000 student mass email in half a second and do the whole 1.4 million row insert in a bit over a minute. Definitely worth the effort!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.