Optimizing database inserts in java. Pointers for spring and JPA users.

As queue is not always a solution :)

9 min readJan 20, 2019

--

I met a lot of engineer over the years. And I’m not talking about the software engineers but construction, agriculture, bio etc. We all share one thing in common. We can solve any problems with the tools that we know of. Student with 0 years of experience can solve the same problems using for loops and if statements as they guy that sits on a bench since 1985 and will use functors and monads. The difference will always be in a quality of that solution. In how extensible , how readable code , how performant the solution will be.

The reason for this overly long introduction and the main focus of this article is that part above that i put in bold “The tools that we know of”. You see the value of experience in any profession comes form expanded knowledge. It is very hard to think about something that we have never thought before. But it’s quite easy to do the opposite. Think about something that already crossed our mind. There is a linear dependency between how much we know and how much we are worth in a job market.

In this peace I want to expand your understanding how inserts are done and what we can do to optimize them. This was the problem I encounter many years ago when I was still a rooke. And back then I totally over-engineer the solution. Fortunately I have enough notes. So I can recreate a problem and this time solve it better.

Don’t trust tutorials

To protect privacy of my work I will show a fake entities. But the rest is real I promise. The full code to this article can be found in my github repo at the bottom of this page. The basic diagram is shown below:

And The classes (parts of it that matter)

@Entity(name="movie_separate_sequence")
@Table(name="movie_separate_sequence")
public class Movie {

@Id
@GeneratedValue
private Long id;

@OneToOne(mappedBy = "movie", cascade = CascadeType.ALL,
fetch = FetchType.LAZY, optional = false)
private MovieDetails movieDetails;
...
}

@Entity(name = "movie_details_separate_sequence")
@Table(name = "movie_details_separate_sequence")
public class MovieDetails {

@Id
@GeneratedValue
private Long id;

@OneToOne(fetch = FetchType.LAZY, cascade = CascadeType.ALL)
@JoinColumn(name = "movie_id")
private Movie movie;

@OneToMany(
cascade = CascadeType.ALL,
orphanRemoval = true
)
private List<Actor> actors = new ArrayList<>();
...
}

@Entity(name = "actor_separate_sequence")
@Table(name = "actor_separate_sequence")
@NoArgsConstructor
@AllArgsConstructor
@Data
public class Actor {

@Id
@GeneratedValue
private Long id;
...
}

This set up in various forms can be found in hibernate tutorials all over net. And it has one basic advantage it is simple. All annotations translate to plane english we can understand what is going on. But do we really let’s take a look at the log with the sql statements enabled:

Whaat? 18 sql statements have been executed to insert a 5 records . When I lunch my performance test (I will talk later), The target response time below 50 ms was never met. The fastest response time was above 100 ms. And around 20 concurrent users I actually stated to receive timeouts. And I had db and my app on the same server if I were to add the network latency things would be much worse. So what can we do about that. Well first if we have truly one to one relationship there is no need for separate id for move and movie_details. We do it by removing @GeneratedValue from the id field and adding @MapsId.

public class MovieDetails {

@Id
private Long id;

@OneToOne(fetch = FetchType.LAZY)
@MapsId
private Movie movie;
...
}

This cut’s the number of executed statements to 16. Strill not not great. But this is the first lesson thing about the Id. Aside from the fact that you don’t have to go to the db to fetch next id you are also saving memory. Remember that the id columns are often index and in order for index to be performant it has to sit in memory. And in order to sit in memory it has to fit there first.

What else can we do. Well if you take a look at the listing above you see that there are two statements executed every time we need a new id. That’s no good. Fortunately we can enable sequence caching. More in this blog post:

So now our movie class look somthing like that:

public class Movie {

@Id
@GenericGenerator(
name = "movieSequenceGenerator",
strategy = "enhanced-sequence",
parameters = {
@org.hibernate.annotations.Parameter(
name = "optimizer",
value = "pooled"
),
@org.hibernate.annotations.Parameter(
name = "initial_value",
value = "1"
),
@org.hibernate.annotations.Parameter(
name = "increment_size",
value = "20"
)
}
)
@GeneratedValue(
strategy = GenerationType.SEQUENCE,
generator = "movieSequenceGenerator"
)
private Long id;
....
}

With that optimization now our insert look like that:

So we already cut our number of statements executed by half. Now it is time to turn our eye on the @OneToMany relation between movie details and actors. As we can see hibernate executes two and insert to some intermediate table. That table hold the mapping of movie id to actor id.

We don’t need that it’s just we se establish one way mapping. If we want to keep this as one way relation we can do that but only partially we will stlee execute six statements it’s just 3 of them are updates. This is because of a flush order put in hibernate. First are all the inserts then all the updates. If you are interested more on this can be found here:

In order to avoid all of that we have to use have a bidirectional OneToMany. Our example is not ideal for this as in reality we want to have many actors assigned to many moves. But for the sake of arguments let's say that it’s not the case. And Between Movie Details and the actors there is truly one to many relationship.

//MovieDetails.java
@OneToMany(
mappedBy = "movieDetails",
fetch = FetchType.LAZY,
cascade = CascadeType.ALL,
orphanRemoval = true
)
private List<Actor> actors = new ArrayList<>();
//Actor.java
@ManyToOne(fetch = FetchType.LAZY)
@JoinColumn(name = "movie_details_id")
private MovieDetails movieDetails;

By adding this code statement log looks like this

Five inserts for five records. I guess we can’t do much better. Well we can add more hardware to db. SSD disk more RAM and more CPU but in the end we will hit that limit. And what to do then. Well the reason the insert to db are so expensive is because they have to transit from one consistent state to the other, and they have to be durable, meaning the power shutdown would not result in data loss. And if we are not willing to compromise that there is little we can do.

So what is importland

On the other hand if speed is essentially we have to pay for it with consistency. And this is really a question to the business what will cost us more. For domains like hospitals, or nuclear missiles we can’t compromise. I mean it’s better for the patient operation to be delayed 5 minutes then to cut off the wrong leg. It’s not like you can send email after and say that you are “sorry, here is 50 % discount for the next operation”.

But we can send email in we sell tickets to to movie and we book the same sit twice. Here the situation is reversed the anxious teenager will go to different site if he/she have to wait 20 second for a transaction to be confirmed.

For the second situation we can implement a queue. Basically put the records to a list and then in the background put those record into a database. No real magic but because all that will happen in-memory it can be lightning fast.

Tests and results

I’m not going to bother you with all the intermediate stages just present the worst and the best result. As you can see the difference is quite substantial.

On the horizontal axis you have the response time and on the vertical line the percentage of all request that were fulfilled in that time. Bacues I slowly ramp up the users from 0 to 40. There was the point at witch I started to receive timeouts and 30% requests timeout.

No optimizations done at all
all of the optimizations

So there you have it. The comparison of two solutions. The funny thing is that the first one can be found in so many tutorials on the internet. So many examples on stack overflow that are probably mindlessly copied to production code. We have to stop doing that. Making changes that we do not fully understand. Because we might pay a price and not even know it.

Curious case of Spring CRUD repository

Let’s say we have the following case. I have a mobile app that can send a purchase orders from many, many clients. And as a form of optimization I want a client app to be able to generate id for that sale that will be unique and I want to use that id as primary key in my database. Wheel the randomly generated UUID is perfect candidate for that. So I have to modify my Entities:

@Id
@Column(columnDefinition = "BINARY(16)")
private UUID id;

@OneToOne(mappedBy = "movie", cascade = CascadeType.ALL,
fetch = FetchType.LAZY, optional = true)
private MovieDetails movieDetails;

I had to switch optional to true. This is because hibernate flush order (I don’t want to get in to that) and change the type of the column. There is something to notice here. The UUID will take more space in memory then Long. And as I mentioned before it is important to keep index size small. Then I had my spring repository:

@Repository
public interface MovieSameIdUUIDAssignedRepository extends CrudRepository<Movie, UUID> {
}

And service:

@Autowired
private MovieSameIdUUIDAssignedRepository repository;
@PostMapping("/sameIdUUIDAssigned")
public UUID createAll(@RequestBody Movie movie){
Movie saved = repository.save(movie);
return saved.getId();
}

but when I called that service I got:

What why there are five selects? Well as it turns out there are two methods on EntityManager (JPA interface that interact with persistent context) one is called merge(T entity) and another is called persist(T entity). The main difference is that merge will call select and then if entity is found it will execute update if not insert. On the other hand persist will always execute insert and if the record with that id is already in the db it will throw error. When we are using spring repo and we have assigned id to entity already it will always choose merge (bellow is part of source code of the two spring classes):

//AbstractEntityInformation.java
public boolean isNew(T entity) {

ID id = getId(entity);
Class<ID> idType = getIdType();

if (!idType.isPrimitive()) {
return id == null;
}

if (id instanceof Number) {
return ((Number) id).longValue() == 0L;
}
......
}
//SimpleJpaRepository.java
if (entityInformation.isNew(entity)) {
em.persist(entity);
return entity;
} else {
return em.merge(entity);
}

But in this case we want to persist, we know that this record do not exist in the db. Well the best thing we can do it to go around repo directly to the source

@Autowired
private EntityManager entityManager;
...
entityManager.persist(movie);

And the log get back to normal

The cost of ignorance.

I also did the performance test for case above. It is a different set up. In order to show the cost I had to instantiate my database in the could in order to show the cost of those additional selects. When you have a small dataset and your db is local the result ware almost identical. But when network latency kicked in this is what I saw.

using repo
Using entity manager

Summary

To be honest I have only one thing to say

spring.jpa.show-sql=true
spring.jpa.properties.hibernate.format_sql=false

are your friends enable them when you develop and analyze the logs see if thy makes sense because you might be paying a price for ignorance. I cured my by reading Vlad’s blong a lot of in this article is from his work.

Resources.

The gatling script

movieDetails.json

Java project used for testing

--

--

Piotr Szybicki’s, Programmer, Java Developer, ML Entusiast