How to optimize database configuration for batching

Ibrahim HAMMANI
Decathlon Digital
Published in
9 min readJan 12, 2023
Hibernate batching optimizations

Introduction

Have you ever needed to execute massive CRUD operations on your PostgreSQL Database? yet realized that those operations are taking so much time and the database is suffering…

In the last few days we worked on a database optimization for batch inserts that allows inserting 49043 records (39.5 MB of text data) in 2.5 seconds while this operation was taking almost 50 seconds before optimizations; so we are at least 15 times faster.

Terminology

  • CRUD: Create/Read/Update/Delete; the three basic database operations
  • RDBMS: Relational Database Management System (PostgreSQL, MySQL, SQLServer …)
  • BULK INSERTS: Is the process of inserting a huge number of rows in a database table at once (one or many transactions).
  • Batching: allows us to send a group of SQL statements to the database in a single transaction; it aims to optimize network and memory usage; so instead of sending each statement by itself we send a group of statements.
  • JPA cascading: If any CRUD is performed on an entity it will be performed to all his related entities.

Important notes

  • In this article we will focus on inserts since it’s the main operation needed in our case.
  • We will use spring-boot for demo so the configurations is written in yaml file
  • Spring profiles are used in the demo project so we could compare the default configuration with the optimized configuration.
  • A working spring project could be found in this GITHUB repository

Disclaimer

This optimizations are suitable for bulk operations, you might use it in bulk operation or in batches, but it’s not the best configuration for web applications or rest APIs; in those cases you might want to stick to the default standard configuration.

SYSTEM DESIGN

Data model

The data model is quite simple, it has 3 tables, container, pallet and parcel.

Each container could have 0 or many pallets, and each pallet could have 0 or many parcels; this gives the below diagram

Data model

Spring-boot application

In the Github project linked to this article, we use two spring profiles; the default and a bulk profile.

  • default: it contains the default hibernate/PostgreSQL configuration without any change.
  • bulk: this will contain all the improvements we did in order to optimize our configuration for bulk operations.

We have only one service with it’s related repository to CRUD on our containers, then whenever an operation is done for a given container, it will be propagated to all the related pallets and parcels.

OPTIMIZATIONS

First step, use cascading

The first step to batching CRUD operations, is to add cascading to java entities, this is not a real optimization but it’s important to CRUD all the secondary entities related to the main entity in the same method call which mean in the same transaction, for example, if we call a save on the main entity all the secondary entities defined as fields in that entity will be saved as well.

to do so we use the cascade annotation:

...
@Entity
@Table(name = "container")
public class Container implements Serializable {

...

@OneToMany(mappedBy = "container", fetch = FetchType.EAGER)
@Cascade(CascadeType.ALL)
private List<Pallet> pallets = new ArrayList<>();

...

}

Using this, we are asking hibernate to save all the Pallets whenever it saves the Containers, and same thing for other CRUD operations.

Second step, use sequences to generate IDs

What we did in the first step is very important; However hibernate cannot “batch” operations for neither entities having IDs generated by GenerationType.IDENTITY strategy, nor entities having SERIAL/BIGSERIAL IDs (in the case of PostgreSQL database).

This is due to the fact that hibernate will wait for RDBMS to end inserting the previous row and generate a new ID for the next row, before generating the next insert statement.

To allow hibernate to batch we need to use the right ID generation strategy; This consists of using SEQUENCES, this gives hibernate the possibility to generate ID by itself.


...
@Entity
@Table(name = "container")
public class Container implements Serializable {
@Id
@GeneratedValue(
strategy = GenerationType.SEQUENCE,
generator = "container_sequence"
)
@SequenceGenerator(
name = "container_sequence",
sequenceName = "container_sequence",
allocationSize = 300
)
@Column(name = "id")
private Long id;

...

@OneToMany(mappedBy = "container", fetch = FetchType.EAGER)
@Cascade(CascadeType.ALL)
private List<Pallet> pallets = new ArrayList<>();

...

}

NOTICE:

All the related entities to the main entity should have an ID generation strategy set to SEQUENCE, if not hibernate saves entities without using batches; In our case, after doing that for Containers we need to do it for Pallets and Parcels as well.

Please don’t forget to create the database sequences in your database before associating them to your entity.

Please, bear in mind that you shouldn’t associate Sequences to IDs on the database level to get your config working.

To be more explicit don’t create tables as below:


/*
don't do this,
because PostgreSQL will create and associate a sequence to the ID
*/
CREATE TABLE container
(
id serial NOT NULL,
...
);
/*
don't do this as well,
because the RDBMS will use the container_sequence to generate IDs
*/
CREATE TABLE container
(
id bigint NOT NULL DEFAULT (NEXT VALUE FOR container_sequence),
...
);

Asking PostgreSQL to Rewrite batched inserts

Hibernate will send a multiple insert statements to RDBMS at once, in order to insert data, and this will be done in the same transaction which is great; however if we are using PostgreSQL we could go a little further in our optimizations by asking him to rewrite those inserts to a single multi-value insert statement; this way if we have a 100 insert statements, it will rewrites them to a single multi-value statement.

See example below:

//before Rewrite batched inserts
insert into container( ...) values (...);
insert into container( ...) values (...);
....
insert into container( ...) values (...);


//After PostgreSQL to Rewrite batched inserts
insert into container( ...) values (...),(...) ..., (...);

According to PostgreSQL official documentation, this will make inserts from 2 to 3 times faster.

Enabling this configuration is very easy, we just need to add “?reWriteBatchedInserts=true” to the database connection URL, for example this is my localhost connection:

jdbc:postgresql://localhost:5432/mastership?reWriteBatchedInserts=true

Ordering before batching

For hibernate to be more efficient in batching (do more batching) especially in a concurrent environment, we will enable order_updates and order_inserts

Ordering by entity will allow hibernate to save entities that are fields in other entities first, and ordering ids will allow it to use the right sequence values in the statements.

pring:
jpa:
properties:
hibernate:
order_updates: true
order_inserts: true

Increasing batch size:

Hibernate uses a batch size where it stores statements, before running them in transactions, by increasing it, we will allow hibernate to increase the number of statements that it will send to the database in a single transaction.

Running more statements in a single transaction will result in less transactions and less time, it will also optimize the usage of the network.

To increase the batch size we use the below property:

spring:
jpa:
properties:
hibernate:
jdbc:
batch_size: 4096

Using this configuration, hibernate will be able to send 4096 statements in a single transaction to the database, note that the batch size is used only in inserts, updates and deletes.

Increasing the fetch size

To do the same thing as the previous configuration for selects, we can use the fetch size param, this will allow hibernate to have more statements in the fetching area, this will decrease the number of round trips to and from the database.

spring:
jpa:
properties:
hibernate:
jdbc:
fetch_size: 4096

Other optimizations/configurations

This is not an exhaustive list of all the optimizations, you may want to check your RDBMS optimization capabilities, you may need to add indexes and follow best practices.

In our case, we tried to optimize the flushing time, flushing is the synchronization of the state of your database with state of your session; the optimization was set by doing the below config:

spring:
jpa:
properties:
org:
hibernate:
flushMode: COMMIT

This configuration wasn’t of great benefit in our case, but you can give it a try, to see if it fits your needs, that’s why we will leave it here.

Bonus optimization for rest APIs

In the related github project to this post you will find a gzip profile that will let you use gzip over http, this could help you optimize performances by optimizing network time in your HTTP requests.

Generate statistic

While working on those optimizations, you may need to check some statistics about the transactions hibernate did, in that case you; can simply use the below settings:

spring:
jpa:
hibernate:
properties:
hibernate:
generate_statistics: true

The full configuration

spring:
jpa:
properties:
hibernate:
order_updates: true
order_inserts: true
jdbc:
batch_size: 4096
fetch_size: 4096
org:
hibernate:
flushMode: COMMIT
datasource:
url: "jdbc:postgresql://localhost:5432/mastership?reWriteBatchedInserts=true"

Comparing statistics before and after optimization

Now let’s see the hibernate’s generated statistics before and after adding the previous configurations, note that we use a one save method call to a spring data repository in order to save a container with all the sub entities, the saving is done by an http POST request. This request has a body containing 39,5 MB of json data (you can find the json sample in the file src\main\resources\json\sample-container.json), and it results in inserting 49043 entities.

Both tests are done with a cascading type set to ALL.

Note that we don’t compare the networking time and hibernate displays metrics about the current session.

Before

Let’s look now at the hibernate statistics using the default config.

Session Metrics {
1272500 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
92831400 nanoseconds spent preparing 49207 JDBC statements;
18557329900 nanoseconds spent executing 49207 JDBC statements;
0 nanoseconds spent executing 0 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
21229826900 nanoseconds spent executing 1 flushes (flushing a total of 49043 entities and 223 collections);
0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

The most important information are: the number of statements we have 49207 statement, the number of batches is 0 because our config is not optimized for it, and the Flushing time; the flushing time is the time spent synchronizing the state of entities in memory with the state of this entities in the database; we will omit talking about jdbc connection acquiring time intentionally since it was not a big deal.

After

And here is the hibernate statistics for the optimized configuration


Session Metrics {
872300 nanoseconds spent acquiring 1 JDBC connections;
0 nanoseconds spent releasing 0 JDBC connections;
6031200 nanoseconds spent preparing 168 JDBC statements;
103321900 nanoseconds spent executing 165 JDBC statements;
734107200 nanoseconds spent executing 14 JDBC batches;
0 nanoseconds spent performing 0 L2C puts;
0 nanoseconds spent performing 0 L2C hits;
0 nanoseconds spent performing 0 L2C misses;
1737581300 nanoseconds spent executing 1 flushes (flushing a total of 49043 entities and 223 collections);
0 nanoseconds spent executing 0 partial-flushes (flushing a total of 0 entities and 0 collections)
}

Side by side comparison

As you could see, after optimizations we have only 165 executed JDBC statements in 14 batches, which means only 14 round trips to the database over the network of course.

In terms of performances we are more than 15 times faster.

Conclusion

Batching operations is of great benefit in term of performances if we are doing a lot of CRUD operations in a single place (methods called consecutively to do CRUD operations), because if we not batch statements are sent one by one to RDBMS which leads to a lot of latency, however in using batch will send statements by group to the RDBMS this decreases dramatically the latency time.

Github repository

Sources

Authors

  • Steve BURGHGRAEVE
  • Ibrahim HAMMANI

Special thanks

I want to thank:

Writing this article wouldn’t be possible without their help and reviews.

--

--