Optimization of SQL inserts into RDBMS

Published in

Strands Tech Corner

4 min readSep 9, 2019

Here at Strands, we use a Spring Boot application that can read CSV files from a directory, aggregate and enrich records, and persist them into an RDBMS.

CSV files might have millions of rows, and each customer can generate as many files as needed. Consequently, the ability to process, aggregate and persist all the data as quickly as possible is highly sought after.

In this post, we are going to focus on how to optimize the persistence of the data into the RDBMS. Every time an insertion takes place, 1,000 records are read, and data is enriched and persisted; so how can we improve the speed of this process?

JPA

In the Spring ecosystem, the first thing to do would be to group all the records in a single collection. These records will be represented as Entities, which will then be mapped into an SQL table. Next, the Entity Manager can be used to persist the data into the database.

Image 2: EntityManager persisting in a collection.

As you can see, the EntityManager doesn’t provide any method that accepts a collection as an argument. We want to insert all the records in one batch, but would that happen after doing the commit?

Let’s find out by enabling Hibernate Statistics.

Image 3: Hibernate stats with batch mode disable

Unfortunately, if we don’t do anything else, JPA/Hibernate will execute an SQL statement for each of the records, which will have an impact on performance. Luckily, the following feature allows executing a collection of records as a single batch insert:

spring.jpa.properties.hibernate.jdbc.batch_size=25

Specifically, this feature dictates that every batch insert should include 25 records. Let’s run the test again and check the hibernate stats.

Image 4: Hibernate stats with batch mode enable

Since we are processing 1,500 records, and every batch comprises 25 records, a total of 60 batch inserts will be executed.

While this is certainly an improvement — the process is now 18 times faster — , the records still have not been inserted in just one SQL batch, which has an impact on the execution time.

Let's try again, this time with a batch size of 1,500 records, and see if we can improve performance.

Image 5: Hibernate Stats with a batch size of 1,500

Now that we have set up a batch size that matches the number of entities to be persisted, only one batch insert is going to be executed. As a result, the process is now 35 times faster than before, when 60 batch inserts had to be executed.

In conclusion, the closer the batch size is to the total number of entities, the more performance will improve.

However, the batch size of the entity manager cannot be modified in a dynamic way. This is a serious limitation in a scenario where the size of the collection of entities is different every time.

If this is our case — and only in those instances where performance is particularly important — , we can use the JDBC driver.

JDBC

When working with prepared statements, it helps to build batch inserts with different sizes, which can be done dynamically. The only drawback is that we are losing the wizardry of the ORM, where the SQL queries are built magically and entities are mapped to each of the columns.

If you are prepared to build a batch insert this way, below you can see an example of how to do it:

Working with JDBC has yet another benefit that JPA doesn’t provide. If an error takes place during the batch execution, a BatchUpdateException is launched. Take a look into the documentation of this class:

Image 7: Javadoc from BatchUpdateException class

This is very useful when there is a ConstraintViolation. One of the records is causing the execution of the batch to fail. With JPA, the only solution is to save each of the Entities individually until we find out which one is causing the SQL Exception. However, using the JDBC API and thanks to the BatchUpdateException, we can figure out which record caused the exception, so we can remove it from the collection and execute the batch insert again.

In this post, we have presented different approaches to reduce the number of iterations through the database, while persisting data in batch mode.

In the end, given that performance was a critical requirement of this application, we decided to use the JDBC API, which gave us more control over the JDBC batch inserts.

Optimization of SQL inserts into RDBMS

Written by Fernando Soler David