When to be reactive?

Bartosz Polnik
6 min readFeb 26, 2018

--

Recently I stumbled upon serving data directly from a database. Whole ResultSet was read to a list and returned from Spring controller. This meant that the whole data had to be stored in memory and this also meant that once we have enough data, we can get into Out of memory issues. When exactly? Let’s do an experiment - compare different ways of serving data with limited memory — 512MB of heap and increasing dataset size.

Prerequisites:

PostgreSQL database with 1 mln rows of Data.

First approach. Returning data as list.

In this scenario, I wanted to return data in the simplest way. We read it as a list and return from controller. Jackson behind the scenes will serialize it to json.

To find threshold at which we cannot read more data, I added additional parameter to controller — limit. Here’s our mapping:

SQL and mapper are simple, so I decided to omit their code.

Before we dive into results, I need to mention that default configuration of my orm mapper MyBatis + PostgreSQL driver tries to fetch all requested rows into memory at once, which renders different approaches to avoiding OOM useless, therefore I configured MyBatis to read up to 1000 rows:

All test presented here use this configuration.

Results

Up until 625 000 rows everything seemed fine — CPU was low, GC was low and response time was good. It was only next 25 000 of pojos that increased CPU to 40% from prior 20% (twofold increase), GC went from 0% to 5% of CPU time and response time grew by 38% to 2.502 s. The final end of memory met 850 000 entities.

A short info on Gatling results — I named my requests after an amount of data they carry, so request named 50 000 has data from 50 000 rows.

Second approach. Server sent events.

To implement this scenario I was looking for a way to return Observables by MyBatis. It meant a lot of debugging, a lot of trials and errors. After what seemed to be quick 4 hours I discovered that MyBatis can also return Cursor, which is Iterable. That was the moment I found home. Mapper code becomes simple:

Let’s move on to conversion from Cursor to Observable and digress for a moment about server sent events implementation in Spring.

In spring sse are provided by SseEmitter. This class has two important methods for us - initialize and send.

Initialize is called by Spring. Once request thread returns from controller, it sets handler that performs marshalling and sending data to client. Before that moment, all invocations of send method add messages to internal SseEmitter’s buffer.

This is important for us because it means that we cannot fetch data using request thread. Had we done that, we would have added all data to internal buffer before sending anything, which would create the same problem of OOM as we try to avoid.

The conclusion is that, unfortunately, we need to use thread pools.

The other thing to note is that we want cold Observable, so that we don’t skip any data between subscribing and reading from database and that we need to manually manage transactions, otherwise mapper returns empty Cursor due to closed SqlSession. All of above is captured in the following piece of code:

What’s left is our controller:

You may have spotted two MediaTypes — one is for the whole stream and one for messages.

Be also cautious about timeouts. Default varies, but on tomcat 8 you have only 30s to finish your async processing. I set it to 5 mins.

That’s all about implementation.

Results

CPU and memory usage is constant with regard to dataset size. CPU lands at 20%. The most disheartening in sse is response time. In many cases, it’s 10x slower than the First approach. Returning data as list.

Although I haven’t included that in the screenshot, full gc reclaimed all memory after the test, so there are no leaks. Or I haven’t found them.

Third approach. Meet Jackson.

Jackson can stream data. Let’s use it!

We start in the same place as before. Mapper is unchanged.

Then we move on to converter from Cursor to Observable:

Almost the same as in Second approach. Server sent events, but now without thread pool!

The whole streaming relies on controller:

Results

CPU and memory are constant with increasing dataset size. Memory usage is greatly lower than in Server Sent Events. Performance-wise it’s 3.5x slower than First approach. Returning data as list.

Fourth approach. Spring 5.

The new version of Spring has support for returning Observable/Flux from controllers. Following the docs, your MediaType defines format of response:

application/json: a Flux<User> is handled as an asynchronous collection and serialized as a JSON array with an explicit flush when the complete event is emitted.

application/stream+json: a Flux<User> will be handled as a stream of User elements serialized as individual JSON object separated by new lines and explicitly flushed after each element. The WebClient supports JSON stream decoding so this is a good use case for server to server use case.

text/event-stream: a Flux<User> or Flux<ServerSentEvent<User>> will be handled as a stream of User or ServerSentEvent elements serialized as individual SSE elements using by default JSON for data encoding and explicit flush after each element. This is well suited for exposing a stream to browser clients. WebClient supports reading SSE streams as well.

No more manual conversion or additional libraries for Observables!

Spring 5 seems to provide everything we can imagine in terms of reactivity. I haven’t tested it in practice, but maybe some other time!

Update: Further improvements

I decided to profile application to better understand why streaming with Jackson was slow. The easiest way for me to do that was to use tools bundled with JDK, so I chose Java Mission Control. It has a great panel for inspecting hot methods — methods where CPU spends much time.

It turned out that the problem was with Jackson’s configuration — by default Jackson after writing value performs flushing. You can disable it globally with:

I repeated all tests with flushing disabled, but only for the Third approach. Meet Jackson it made a bigger difference, so I’m attaching additional results solely for this approach.

Results for Jackson streaming

CPU usage is gradually increasing until it reaches 18% — the smallest value we’ve seen across all tests with same dataset size. Memory’s also low. Comparison of response time shows dramatic improvements — disabling flushing made server respond more than 4x faster — better than the First approach. Returning data as list, which was previously leading. It’s clearly the best way to serve content from a database from the perspective of an application.

That’s all for today. Thanks for reading!

The whole code from this article is available in my Github repo https://github.com/bartekbp/blog/tree/master/data-streaming.

--

--