Persisting fast in database: JDBC

In my previous post I explained how to persist information when you are forced to use JPA, and how to make it run as fast as possible. This time we will see how we can persist the same information directly with JDBC, without intermediaries.

Following the same structure of my previous post, I will show code, changes, and results in each step. To simplify the examples I have extracted code that is repeated in all of them:

All source code is available in this GitHub repository.

Using JDBC directly

JDBC is the basic and standard interface for communicating Java with any database and is part of Java SE.

I personally consider that, along with the Servlet specification, it was the key to convert Java into The Enterprise development tool in the late 1990s.

It is probably the most verbose option of all, having to write a lot of low-level boilerplate code. It has the least overhead by bypassing all frameworks code and attacking the driver directly.

The incorporation of try-with-resources into the language freed us from much of that boilerplate, but it hasn’t freed us from attributes and names mapping.

PreparedStatement

Statement is the simplest way to execute insert sentences, with query and parameters inlined in the same String. I directly give it up and discard it.

When you have to repeat the same Statement many times, the first mechanism that allows us to reduce the execution time of our statements is the PreparedStatements, which reuses the query information between one invocation and another.

But the main reason that should lead us to use PreparedStatement is to avoid SQL Injection attacks.

1.- A register each time

The simplest version of all, where one PreparedStatement is executed after another, with theautocommit mode set to true (after each operation a commit is executed by the database):

The result has left me totally confused. Comparing to JPA in the equivalent test of “a register each time”, the improvement of MySQL seems very poor to me (a 3X improvement), while in Postgres it is spectacular (a 30X). I don’t know the reason for this difference in behavior/performance, and I’m writing it down for research.

2.-A register each time in transactions of 1000 elements

Initiating each transaction and closing it takes time, so if we remove the autocommit we should save time. What happens if we execute manually a commit every 1000 elements?

Finally MySQL seems to make some sense and gets a 12X improvement over the previous version, while in Postgres it is only 40%.

It’s clear that MySQL has a problem managing transactions (comparing to Postgres), but I don’t know if it is a problem in the JDBC driver or general to MySQL.

3.- In batches of 1000 records

The next natural step is to persist the information using the addBatch and executeBatch methods offered by JDBC.

The addBatch method simply accumulates in memory all the requests you make, and waits for aexecuteBatch invocation to send all the information to the database.

Running in batch does not imply that it runs in a transactional way. If you set up the connection in autocommit, after each statement you have created, the database will ensure that the result persists correctly before moving on to the next statement (along with the other security checks you have configured in your isolation level). In autocommit mode, if the nth sentence of a batch raises a constraint exception, all previously inserted rows will not be rollbacked.

In this case, I will make a commit associated to each batch:

Here again, MySQL disappoints us and gets a marginal improvement. This confirms us that MySQL driver does not implement any kind of optimization on batch operations.

For Postgres, the change is very positive and gets a 4X improvement working in batches.

Unlike the case of JPA, here we don’t have to worry about primary key problems, and we can use each database engine sequence system.

4.- Rewrite batch statements

In this case, we can also use the optimizations offered by each JDBC driver in both MySQL and Postgres when executing batch operations.

As I already explained in my previous post, this improvement groups several insert statements and rewrites them into a single one reducing the number of communications with the server, and the work it has to do with them.

The main reason for not using this configuration by default is because not all possible insert statements are supported, and because if you have an error in one of the statements, it cannot tell you where the problem is in the rewritten query.

If none of these problems exist for you, I would modify the database connection settings NOW!

Activation is done over the connection URL by adding the rewriteBatchedStatements parameter in MySQL and the reWriteBatchedInserts parameter in Postgres. There is no need to modify any code, just the database connection:

Finally, we find the right configuration for MySQL! This time we reached a similar level to Postgres, getting an improvement of 7X.

Postgres on the other hand, having done his homework in other parts of the system, “only” gets a 70% performance improvement, with some margin still over MySQL.

Next step?

Here I run out of tricks to improve performance in the JDBC area.

There is still room for improvement in the low-level configuration of the database, where some small improvements can be achieved by changing its usual behavior.

Like in the post about JPA, I have used 1000 as the number of elements in the batch because it is big and a round number, but it would be nice to try different sizes to find the most suitable one for each database and information to persist. The results will probably change slightly if we try another batch size.

In the next article we will see that we still have room to persist even faster from Java, taking advantage of certain features provided by JDBC drivers, and the databases they support.

Conclusion

We have seen how to improve the performance if you can avoid dealing with JPA or similar frameworks, concluding that the fastest way is using batch operations and taking advantage of the rewrite batch optimization:

Clearly the only way to work efficiently with MySQL is using rewriteBatchedStatements, and if we compare with Postgres, MySQL is not very efficient in communicating with the database. Can any of you shed some light on the subject? Unless I have made a mistake with some setup, I personally think MySQL's performance is very bad if you don't have the opportunity to take advantage of this optimization.

I have not yet tested the new version of the JDBC driver that Oracle has developed for the latest version of MySQL. It’s the only one that supports the newly released version 8 of MySQL, and it’s a complete rewrite of the driver. Have they improved anything? The day I try it, I’ll tell you.

Reconclusion

How different is the performance between JPA and JDBC? The best result of each option, face to face, is:

The improvement is around 50% and 25%, and it is not of the order of magnitude you would expect due to the overhead of a tool like JPA.

Before taking my results as something extrapolable to your problem, I invite you to run benchmarks with your data, your infrastructure and reach conclusions. They are likely to vary, either because of the data shape, server configuration or simple network latency.

The configuration of these benchmarks is the same as in the previous post: