Spring Data JPA — batching using Streams

Published in

Predictly on Tech

5 min readSep 29, 2021

Spring Data JPA is great. It saves us time to focus on what’s important while not sacrificing much in terms of usability and performance. I have now been using Spring Data JPA in a batch and ETL setup for more than one year and learned a lot in the process. Time to share some findings!

The first and most important finding when working with streams is that you can’t just swap your return type from List to Stream in your repository interface! Doing so will create a lot of unwanted side effects and often not produce the result you expected.

In this article, I will use the MySQL Employees database via the docker container genschsa/mysql-employees. The JPA model of the Employee entity is shown below

Container performance issue

A quick note on the docker container. Once started, the statistics of the database I used for this article was incorrect leading to very poor performance in joins. If you want to replicate or test the findings of this article, the following SQL statement is necessary:

select * from mysql.innodb_table_stats where database_name = ‘employees’;

Check that the values of n_rows is correct for each table, especially for salaries. If it says 0, run this query and then check again:

analyze table salaries;

Working with unbounded Result Sets

Spring Data JPA repositories in their default setup expects to return instances of either Optional or List, meaning that the result set will be lifted into memory and mapped to your model object. This works well for many use cases, but when the result set becomes very large (> 100,000 records) or the result set size isn’t known in advance, memory will become a problem. Enter Stream.

Creating and Closing a Stream

Creating a Stream is extremely easy thanks to Spring Data JPA which supports this out of the box.

Returning Stream of Employees, see more below for crucial details

This can then be used in our code as follows

Reading and writing employees to file

Line #1 — Any stream operation must be wrapped in a transaction
Spring will throw an exception otherwise

Line #3 — A stream is created by Spring Data JPA but must be closed by you
A stream keeps a cursor to the result set open since it can’t know when the result set has been consumed. If not closed you will run out of cursors in the database. Use try-with-resources or call close on the stream.

Line #4 — Forward operations only
A stream can only be consumed once

If we run the code it will work and we will get a Stream to work with, but it will take a long time to complete as we increase the gap between hireDateFrom and hireDateTo. Using SQL logging we can see that for each employee, a second and third call is made to collect all salaries and all titles for that employee. In our case, this is triggered via the call to EmployeeMapper::map. This is normal JPA and Hibernate behavior, lazy loading child collections when they are needed.

The problem we often face is that if we are turning to use Streams, we often also need all or many of the entity’s child collections populated.

Joining child collections

Child collections can be populated during the initial query but special care must be taken to avoid pitfalls that are the result of switching to using streams.

Eagerly loading child collections to Stream

Let’s start by examining the query.

Line #4 — Remove duplicates in stream via DISTINCT

By default, JPA and Hibernate will return an instance for each row returned in the result set. Since we are now joining multiple tables, the id of an employee will appear many times. Even though each instance of an employee is identical, it will appear as many times in the stream as the join produces rows for that employee.

Line #5–6 — FETCH JOIN to eagerly load child collections

By eagerly loading these collections, JPA / Hibernate understands that it does not need to populate the collections once accessed. Fetching multiple child collections works as long as they are represented with a Set in the entity and not with a List. If they are represented as a List, Hibernate will throw a MultipleBagFetchException (note on performance from the Hibernate expert Vlad Mihalcea)

Line #8 — ORDER BY to remove partial results

How does Hibernate know that an entity is ready in the stream when each Employee will result in multiple rows in the result set due to the join? Hibernate solves this by looking at the id attribute(s) of the entity, as soon as they change in the result set, an entity will become available in our stream.

This means that if we don’t order our result on the primary entity’s id attribute(s), we will end up with partial results since there is no guarantee that the database will keep rows relating to the same Employee together.

Instructing Hibernate with QueryHints

To make full use of streams in JPA, we need to provide additional information.

Line #10 — Enable streaming via HINT_FETCH_SIZE

This instruction is crucial to make use of and benefit from streams in Spring Data JPA. Without this, the underlying JDBC driver will actually still load all entities from the result set into memory even though we have declared the return object to be a Stream.

Even more strange is that the value is specific for different drivers. For MySQL it must have a value of Integer.MIN_VALUE, for PostgreSQL it must have a value but it can use any positive integer, e.g. “500”.

Line #13 — Remove duplicates in stream via HINT_PASS_DISTINCT_THROUGH

This instruction informs Spring Data JPA/Hibernate not to pass a DISTINCT statement to the database via SQL. Instead it will interpret the DISTINCT statement in our JPQL as an instruction to Hibernate not to return the same entity one time for each row returned, i.e. it is used in conjunction with the instruction regarding DISTINCT explained above.

Managing memory when consuming a Stream

Using the repository above we are ready to consume and process each Employee via a Stream. However, if we invoke exportEmployees as above we will eventually run into an OutOfMemoryError.

The reason for this is that even though we are streaming Employee entities from the database and have marked the query and transaction as read only, Hibernate keeps track of all entities in it’s persistence context. After going through a few thousand records, the heap will be full of these entities.

To resolve this we have to handle each entity in the stream in a different manner and most importantly tell Hibernate that it shouldn’t keep track of the entity in the persistence context once we are done with it.

Reading and detaching Employees from a Stream

Line #11 — Remove entity from persistence context

With this instruction, we let Hibernate know that we are done with this instance and it should no longer keep track of it. Once Hibernate releases the object from it’s persistence context, the JVM can successfully free the memory during the next GC.

Conclusion

Spring Data JPA and Hibernate are great tools but they were never intended to be used in batch or ETL settings. With some careful setup they can actually be used with great success for these kinds of purposes, making it possible to reuse a lot of the awesome features in Spring Data JPA and Hibernate while rapidly processing millions of entities without causing runtime errors.