Optimizing bulk inserts

Published in

homo technologicus’ asylum

2 min readMay 19, 2015

In preparation for my talk for php[world], I was genuinely concerned that the topic was too basic for vast majority of developers (I even purposely named the talk PHP performance 101).During the talk, I was really surprised to see that most people in the audience found the performance tips I prepared helpful, if not critical, in solving their current problems. I even had one person admit that they were still struggling with n+1 problem. I guess admitting the problem is the first step to recovery. Because of this type of (somewhat unexpected) interest, as I see problems in codebases I come across, I try to write more detailed posts giving people real examples of performance impact. This one is about minimizing connection overhead and number of round trips to the database.

A couple of days ago, I was talking to a colleague over a few beers (the best way to talk tech, for those who don’t know). He was building a quick prototype to benchmark inserting a little over 12 million records into PostgreSQL. The major complaint was that, on his laptop, the process took more than two and a half hours to complete. Now, we can all agree that laptop benchmarks (or state of general software readiness) is not a metric to be concerned with, but 2.5 hours, for a relatively small data set, seemed excessive to me. Looking at (read: listening to the description of) the code showed a pretty straightforward node.js/Pg implementation of mass insert in a loop, a very classical baseline for a throwaway script. And with a couple of common performance improvement suggestions we were able to reduce run duration from 2.5 hours to be under 15 minutes.

minimize number of connects
Do not try to connect/disconnect with every transaction. It is not free. Many people don’t think about it, but connection to the database (especially over network) costs resources. It is usually a minimal overhead, but overhead nonetheless. If you connect and disconnect with every query, your overhead with grow in arithmetic progression (for visualization see slide 14)
minimize number of transactions
To continue the theme of minimizing the number of round trips to the database, reduce the number of transactions sent to the database. Note, transactions not queries. Wrapping multiple queries in a transaction allows to submit multiple (read: thousands) queries in one go. This can be accomplished by either turning AutoCommit flag off and managing batch commits manually, or using BEGIN block to build own batch transaction.
maximize parallelism
This applies more to the script than to database itself. If transactions and batching is used properly, you can fork multiple processes inserting thousands (if not millions) records at the same time in parallel. This, of course, if limited by CPU (don’t fork bomb the machine!), but even with small number of parallel processes you’re significantly reducing time to completion.
There were additional performance improvements that could be done, like using pg.native instead of pg with node.js would give another 20%-30% performance improvement, but for a throwaway script 15 minutes was good enough. After all — premature optimization is the root of all evil.

Optimizing bulk inserts

Written by Leon Fayer