System design: cache (with java impl)

10 min readFeb 21, 2020

Is your hardware fluttering in futile attempts to speed up the response with frequently used data from the database to the client? Are you tired of spending money on SSD? It’s time to inject cache into your application!

Let’s consider how can we design a cache system and implement some options on java.

What do you need to know about your data & requests?

- validity — how long can you save your data in cache

- durability — can you lose some data or not

- size

- read & write frequency

- latency

- availability

- scalability

If you know your data and estimated requests, you can choose best approach for caching and provide high hit rate (when the most required data already in the cache), you know what to do with cache miss (when the data is not found in cache during read or write operation) and can properly set the TTL (time to leave — when the values need to be removed from the cache).

Read-Write data policies

There are three basic approaches: Write through, Write around and Write back. We will also touch some more options.

1. Write Through

Reading data in all policies happens in the same way:

1. Send request through the cache to get value.

2. If we found the value in cache — return it.

3. If not — redirect request to DB.

4. Obtain actual value from DB.

5. Write it to the cache.

6. Return the response with obtained value.

The difference in write way.

When we use write-through variant, we store the data into DB and into cache in parallel (or in order). The significance is that we send response only when the data has been written to both places — DB and cache.

Pros: fast read recently written data, data durability.

Cons: slow write, cache pollution.

Write-through cache suitable for most kind of data, which requires data durability and fast read recently used or written data.

2. Write Around

Data is written bypassing the cache. It means all requests are sent directly to the DB only and written data does not hit the cache (except of delete/update/invalidate deleting or writing data if you care about cache–DB consistency).

Pros: data durability, no cache pollution

Cons: slow write (but might be faster than write-through cache), slow read recently written data

Write-around cache suitable for apps, which don’t need to read recently written data. For example, storing the calls in call-centers: these records are unlikely to be immediately listen.

3. Write Back

Data is written right to the cache and in parallel (or after) it writes into DB. The difference with write-through approach is, in this way, we don’t wait answer from DB. When the data is written into cache, we send the response with record confirmation and recording to DB might continuous in the background.

Pros: fast write, fast read recently written data

Cons: no data durability, cache pollution

Use write-back treatment to achieve fast recording. Of course, you should keep in mind that your data might be loss (in situation when something failed and the data was written to the cache but did not have time to write to the database). You can use it for recording some not so important data — metrics, for example.

Things to think about

There are lot of other approaches and implementations, like:

- write-invalidate –write-around with invalidating the value in cache, if it present

- write-update — write-around with updating the value in cache, if it present

- write-only — like write-back but read operations don’t promote data to cache

- when to flush dirty pages in write-back — immediately or after checkpoint or only after receiving a new value for recording

- and so on..

Which data structure is most suitable?

Any guess? Of course, it’s a hashtable! Because it has O(1) access for both put and get operations (if you have good hash function which provide a uniform distribution of hash values).

Eviction policies

But hashtable has O(n) space. Therefore we need some eviction policy to remove values from the cache. It could be LRU, MRU, FIFO, LIFO, RR, ARC & etc.

We’ll talk about them in one of the next articles.

How should I handle the requests?

It depends on your requirements and requests frequency. The popular approaches are:

1. Thread per request.

If you haven’t a lot of requests, you can use this variant. It means you create new thread for every new request. It’s easy but really wasteful in relation to server resources.

2. Thread pool.

Like the previous approach you assign thread to every request, but don’t create new one every time, just take existing one from the pool. In this way you save a lot of resources. But if your system has got a huge amount of requests, you will be forced to increase the number of threads in pool, and the server will be discouraged.

3. Event driven.

a. All requests are queued in some event queue

b. one thread spins forever in event loop

c. it takes new one request from the queue and pass it to perform by any free thread from the thread pool

d. then “event loop thread” comes back to the loop for getting new events from the queue

e. “threads from the thread pool” perform the received task and get the data

f. put the value to callback function

g. some thread executes this function and the response is sent back.

In this option, you regular the stream of processed data and will be ready to get the huge amount of requests.

How make our system fault-tolerant and persistent?

The cache system is stored in RAM for faster IO operations. So, what if the server will be restarted due to a fault? We need to make it persistent and retrieve back all data.

1. Make snapshots

You can run asynchronous background service to freeze your data in the cache and make a snapshot using some kind of trigger (regular interval or after a certain number of changes). And then it service will store this snapshot to hard disk as dump file or into DB, for example.

The disadvantage is that you can restore your cache to the snapshot version, but not to the version like right before server restart.

2. Log reconstruction

Instead of snapshot, you can log all your operations with the cache (read, write, delete, update). The choice of sync or async logging depends on your system and requirements. You can also apply some kind of queuing here so as not to affect app performance.

If the server goes down, you will have the log file and all as you should to do — is reconstruct your last cache version by executing every instructions in this log file.

Why do we need to log read operations? It depends on your eviction policy, for example. If you’re caching most readable values, you need counting and logging read operations. Because if you don’t, you will get wrong (not the same) version of cache after reconstruction.

3. Copy

Instead of logging or snapshotting you can try just copy your cache into DB or in some another storage. “Just” — it’s simple to say, but not to do. You can create index by key in the table, but what to do with value? It could be complicated object and it’s hard to organize appropriate table or serialize this value completely.

4. Combined method (snapshot + log)

It is just combine first and second approaches. You are logging all operations and sometimes doing a snapshot. When you did a snapshot, you should clear all instructions in log file until snapshot timestamp. It will save your space on the hard drive, accelerate restore and allow get the latest cache version after failure.

And now — what about availability?

Let’s suppose we have two nodes with their own hashtable and use consistence hashing in our cache system. What if the server one goes down? Requests will be sent directly to the database. Not cool, is it?

We can replicate all data to another one server (mirror). If node one goes down, the client still keeps up the conversation with replica one. This method has better performance for reading data, because the client communicate with both nodes (node and replica) and requests could be distributed between them. But support consistency data between node and its replica is problematically.

In the next approach we can’t send requests to the slave node, the communication occurs between client and master node only. If master falls down, the client will deal with the slave node.

Let’s implement three basic policies on java

Imagine, that we have an application, which stores the cars and the people and connect them together to know car owners. We can just send car number and get info about it and its owner person.

So, create model classes — the Car (with owner person id):

the Person :

and the CarOwner (to response with full person info, not just an id from the Car class):

Create DAO classes for them with “embedded DB” and some methods:

Create services and add the delay to imitate too complicated operation and a worst connection:

Add delay to the PersonService class in getByName() and getById() methods too.

Why was the delay inherent only in these methods and not in everyone, which has operation with DB? I’ll explain it further.

Create controllers for car and person:

So, we want to cache two operations:

1. get a car and its owner info by car number (read operation)

2. connect the car and person (write operation)

Let’s create the CommonController:

And CommonService interface:

then it’s implementation without a cache:

Get car and its owner info consist of two operations with DB:

1. get car by number

2. get person by id

Both could be cached if we create a hashMap with key (String carNumber) — value (CarOwner carAndOwnerInfo). That’s why we added the delay to these both operations — to show how cache can improve your application latency. getOwnerByNumber will be completed in four seconds without cache or in zero seconds with it.

Set owner to a car consist of three operations with DB:

1. get car by number (could be retrieved from the cache)

2. get person by name (couldn’t)

3. set owner to a car (could be written to cache firstly with write-around approach)

setOwnerToCar will be completed in six seconds without cache or in four or two seconds with it (depends on write data policy).

Finally, let’s create a cache!

Now create every option:

1. Write-around

To get car & owner we check the value existing in the cache and if it hasn’t we get the value from DB and write it to the cache.

To set owner we check the car in the cache (for optimization), get person from DB, set owner and then invalidate the cache record.

2. Write-through

Get operation is the same.

Set operation put the new connection between car & person to the DB and then to the cache.

3. Write-back

Get operation again the same.

Set operation uses both optimizations — to get a car from the cache and to set owner firstly in the cache, and only then (after sending a response) write it to DB async.

How can we check improvements? Create a test!

What we wanna check? Write time.. read time.. and also read after write.. and write after read.. Because each write or read operation may cause cache filling and we can use it for better performance later.

Run the tests. Look at results.

Choose the profile in application.properties and run the tests

mvn test -Dtest=CacheTests

or in your favorite IDE Idea just click run inside class CacheTests.

here is the results:

spring.profiles.active=CommonServiceWithoutCacheread time: 24.014894472write time: 18.0088929550read after write time: 24.017792753write after read time: 18.008320701

Worst execution time, because every operation with DB takes two seconds.

spring.profiles.active=CacheWriteAroundServiceread time: 12.011299223write time: 18.011621115read after write time: 12.014920908write after read time: 12.005370793

It’s better for read. Write after read operations are faster due to “look up cars in the cache” optimization.

spring.profiles.active=CacheWriteThroughServiceread time: 12.012618517write time: 18.011820782read after write time: 0.01078917write after read time: 12.006140848

It’s the best execution time for read after write operations, because all recently written data is in the cache (unlike with write-around policy).

spring.profiles.active=CacheWriteBackServiceread time: 12.011902266write time: 12.011948607read after write time: 0.01073972write after read time: 6.005554598

This one has the best of the best of the best performance, but you could lost your data, if your server suddenly goes down.

Conclusions.

Now you know how to implement the cache system, how to do it scalable, fault-tolerance with 100% available, and which write-policy you need to choose for better performance.

You can look at the full code on my github project.

Feel free to shake hand on Medium or Telegram or Twitter.