Upcoming Rails 6 Bulk insert/Upsert feature
ActiveRecord has update_all and delete_all methods for some time already. Starting upcoming Rails 6 new methods insert_all and upsert_all are added. If you ever used activerecord-import, similar functionality was merged to ActiveRecord master itself.
Quick irb example for table having column
slug with unique index:
One row was created, one row was updated, all done within one SQL query done by vanilla ActiveRecord. You can pass collection of 2 elements or collection of 1000 elements, it doesn’t matter. There will be still only one SQL query used to import them all. Say hi to bulk upsert in Rails codebase. TADA!
There’s still one caveat making this not useful for all cases — you can’t specify which attributes to use during update (for example to skip
created_at attribute during update), but it is already being discussed and I believe it will be addressed before Rails 6 release.
That’s all! Now you can go party hard if you’ve been waiting for this for long time (as me), or feel free to continue reading this article explaining new bulk import/upsert feature in more depth.
Setup for local test
OK, so you’re still here. Let’s start with simple inline ActiveRecord model to test new methods.
You can clone my gist with example code, bundle dependencies and optionally start PostgreSQL via docker. I’m using PostgreSQL because not all features are available for other database adapters and versions. If you have PostgreSQL already running, you will probably need to change connection details in
setup.rb. If any command is unsupported for your database adapter or version, you should get nice exception explaining missing functionality problem.
Once all is prepared, we can enter irb (via
ruby -r./setup -S bundle console to start testing new feature.
posts table having unique
post column and rails model timestamps (
updated_at) is created and ready for test!
Let’s start with classic way of creating records to see some differences later.
Everything works as expected. We can create unique post and if we try to create it again it will not create it or throw an exception (depends if bang —
create! method is used).
Let me introduce you to insert and insert!
As you can see, a little different SQL INSERT query is created having
ON CONFLICT constraint. Anyway our insert fails due to missing
created_at value. Newly introduced methods are in analogy with update_all and delete_all methods, they work directly on database level and they are not triggering model callbacks and validations.
updated_at timestamps are populated in callbacks and we need to populate those columns manually in this case.
As you see (in
result.rows), insert finished successfully but nothing was inserted since line having same
slug was already created before. That’s the purpose of
ON CONFLICT DO NOTHING part in SQL query. To raise exception in this case, we can use bang version of insert method aka insert!.
ON CONFLICT DO NOTHING was not added to SQL query in this case and expected exception was triggered by database.
Create or update? No, just “upsert”.
To create or update new records based on unique index there’s new upsert method introduced for this purpose.
First upsert call will create new row, second one will update it.
Sadly it is not possible to specify which attributes to update when row is already present. For now all attributes are updated, including
created_at. When update is performed for record, it also touches
created_at value. That’s probably not expected. Fortunately there’s already discussion opened at github to add parameter to specify which columns to update. Once that will be implemented and merged, this problem should be gone.
Insert or update one row is cool, but there’s actually no huge performance benefit. What about inserting bunch of records?
Let’s try to insert 1000 records in old “sequential” way. I’m using Benchmark#measure method to track how long does it take.
1000 records created locally in 16 seconds in 1000 transactions executing 2000 SQL queries (one SELECT for ActiveRecord uniqueness check and one INSERT to create actual record).
insert_allwe can prepare dataset in memory and create it with one query.
Boom, we have reduced time needed to import from 16s to 0.4s. We have exchanged ActiveRecord callbacks and validations for performance. Once you’re aware of this trade-off it is pretty good deal! You can run validations manually on your data set, also you can rewrite your callbacks to be able to be run on collection and run that right after data are inserted…
If you need to ensure all rows were inserted you can use “bangified” version
Same dataset was already imported in previous example, it fails on first entry as expected.
The king of all methods mentioned in this article. Do you have dataset with unique key you would like to import to database and let do the database the hard work of merging all data together in efficient way? upsert_all is in here to save you a day!
The only caveat for now is
created_at will be also updated for already existing rows. But this should be addressed before rails 6 release as I wrote in upsert section.
Rails (ActiveRecord) is finally able to do efficient bulk import/upsert (at least for PostgreSQL adapter) on its own. It took some time (there are Rails issues more than 4 years old asking for this feature), but it is really easy and simple to use (which is not surprising in Rails world).
If you re curious how this was implemented, take a look at original pull-request. It is actually not complex code change. Also it is good time to test this feature, report problems and comment on current related issues before Rails 6 is released.
I can’t personally use this feature in my Rails projects since we need more advanced control on bulk upsert, but I’ll try to do my best to contribute to make this feature better to everyone!