What Cache Strategy To Use

What is Cache?

Cache is temporary storage to speed up performance. At times, we might have data that frequently used. For example, a function need interest rate based on loan period (tenure). In database, you might have table like this (simplified):

table loan_interests

If your logic need to process total loan, with this formula :

total_loan = principal_amount * (100 + interest_rate)

where principal_amount is user input, then to get interest_rate you need to query database

SELECT interest_rate FROM loan_interests WHERE tenure_month = x

Database operation is expensive. You need to take connection from pool (assuming you use database connection pool!), or worst, creating new connection. Then there is network overhead. Those costs might be just several milliseconds (for example, 10ms). But if you need to process 200k data, the overhead for database query will be at least 2 million (10 * 200k) milliseconds, which is roughly 33 minutes! And that’s only for getting interest rate, not the application logic itself.

Here, the cache can help. Instead of querying database, put the data on the cache. Cache is usually a key-value pair, so in sample case above, the key is tenure_month and value is interest_rate.

The common cache how-to are (in sequence):

  1. find interest_rate (value) at cache, based on tenure_month (key)
  2. If the key does not exists on cache, query to database
  3. Put the key & value from step #2 into cache, so next logic which uses same tenure_month (key) will get data from cache (step #1)
  4. Expire (remove) the key after certain time. This step is usually from cache config, or pre-defined during set the key (step #3)

Cache Sample?

When the word ‘cache’ mentioned, some people might correlate it with redis. Not wrong, but redis is just one product for cache (distributed cache, to be exact). Redis can be used and helpful, but depends on the case, it might not be the most suitable.

There are application cache, and distributed cache (e.g. redis). As stated above, cache is basically key-value pair. You find a value based on key, and when the key not exists, you query the original data store and put the key-value pair to the cache.

Every modern programming language already has data structure for key-value pair. Java has Map. Python with dictionary. C# also called as dictionary. This should be the fastest performance cache, but it is very primitive. It will eat up memory. You need to remove key manually, while cache mostly will remove key automatically after certain configured period. But this is the fastest, since it live locally on the application memory itself.

Other than primitive map (dictionary), there are some in-memory cache, which the backing implementation might be primitive map (dictionary), but has cache capability (auto expire, etc). In java language, this is something like caffeine.

Then if your application is distributed microservice, but need to cache response (JSON) from multiple deployment nodes, use something like redis. Also, if your cache can be very large, that it will consume memory if put on local application node, use redis.

Cache is key-value!

Note one thing : cache is key-value pair, so don’t use array or list as cache. This is seemingly a cache, but not. From the sample above:

  1. load all the table rows to array / list
  2. When application logic requires interest_rate, iterate through array / list to find the matching data, and return it.

Regardless that the array / list is live locally in memory, the time complexity (Big O) will be different. Using Java Map, the complexity is O(1). Using list with n elements, finding a matching element will have complexity of O(n). That’s a lot!

Performance Benchmark

Theorically, the performance should be (fastest to slowest)

  1. in-memory primitive map / dictionary
  2. in-memory cache with primitive map / dictionary as implementation
  3. redis (due to network overhead)
  4. list. OK, this is WRONG implementation. But just to give a view, I will include this on the benchmark.

The benchmark will has following criterias:

a. using 500k cache data

b. using a list with 500k data, all data exists on cache. We will just count how many data on this list exists on cache (should be 100%, 500k)

c. runs on same machine : Java using Amazon Coretto 17, quad-core i7 Intel CPU each 1.8 GHz, and 16 GB RAM

d. For each benchmark iteration, data cleared out (e.g. removing data from docker redis)

Benchmark Result

If you just interested on the result, this is it. Using 5 runs for each cache scenario.

Summary:

  • ConcurrentHashMap (primitive map) is the fastest, with only 0.17 second to lookup data. If you need fast operation with bulk data, with less feature (no key removal), for example where the cache is live as local variable (or something that can be garbage collected by JVM), ConcurrentHashMap is best.
  • Caffeine (backed by map) is the second fastest, around 0.35 second to lookup data. Use this if you need cache feature (e.g. auto key removal, limiting number of keys), and the cache lives on local application node, but need to be persisted for long time, not local method variable.
  • Redis tooks 362 seconds, that around 6 minutes. This seems the effect of json de/serialization. So redis is not always the best choice. For API response cache, connected with API gateway, or when the app deployment is distributed among machines (VMs / K8s worker nodes), redis is good though.
  • List? It tooks almost 2 hours just for one iteration. So I don’t want to continue this wrong case. And you shouldn’t use this kind of implementation too.

Codes

The value class to be saved. Given example above, this should be JPA class.

package com.course.kafka;import java.time.LocalDate;
import java.util.Objects;
public class MySomething {private String name;
private String address;
private LocalDate birthDate;
public MySomething(String name, String address, LocalDate birthDate) {
super();
this.name = name;
this.address = address;
this.birthDate = birthDate;
}
@Override
public int hashCode() {
return Objects.hash(address, birthDate, name);
}
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
MySomething other = (MySomething) obj;
return Objects.equals(address, other.address) && Objects.equals(birthDate, other.birthDate)
&& Objects.equals(name, other.name);
}
@Override
public String toString() {
return "MySomething [name=" + name + ", address=" + address + ", birthDate=" + birthDate + "]";
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getAddress() {
return address;
}
public void setAddress(String address) {
this.address = address;
}
public LocalDate getBirthDate() {
return birthDate;
}
public void setBirthDate(LocalDate birthDate) {
this.birthDate = birthDate;
}
}

This is the code for primitive cache using ConcurrentHashMap.

package com.course.kafka;import java.time.LocalDate;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Optional;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.stream.IntStream;
public class MapCache {public static void main(String[] args) {
var cacheNumbers = 500_000;
var dataAgainstCache = new ArrayList<MySomething>();
var cache = new ConcurrentHashMap<String, MySomething>();
var start = 0l;
var end = 0l;
var counterExistsInCache = new AtomicInteger();
var counterNotExistsInCache = new AtomicInteger();
IntStream.range(0, cacheNumbers).forEach(i -> {
var name = "name-" + i;
dataAgainstCache.add(new MySomething(name, "Some dummy address " + i, LocalDate.now()));
});
Collections.shuffle(dataAgainstCache);

// populate cache
start = System.currentTimeMillis();
IntStream.range(0, cacheNumbers).forEach(i -> {
var name = "name-" + i;
var mySomething = new MySomething(name, "address-" + i, LocalDate.now());
cache.put(name, mySomething);
});
end = System.currentTimeMillis();
System.out.println("Populating cache took " + (end - start) + " ms");// validating cache
start = System.currentTimeMillis();
dataAgainstCache.forEach(d -> {
var cacheResult = Optional.ofNullable(cache.get(d.getName()));
if (cacheResult.isEmpty()) {
counterNotExistsInCache.incrementAndGet();
} else {
counterExistsInCache.incrementAndGet();
}
});
end = System.currentTimeMillis();
System.out.println("Validating against cache took " + (end - start) + " ms");
System.out.println("Exists : " + counterExistsInCache.get() + ", not exists : " + counterNotExistsInCache.get()
+ " total : " + (counterExistsInCache.get() + counterNotExistsInCache.get()));
}
}

This is the code for in-memory cache using Caffeine,where data expires after one hour. I use caffeine 3.0.5 on build.gradle

implementation 'com.github.ben-manes.caffeine:caffeine:3.0.5'

This is the caffeine code.

package com.course.kafka;import java.time.Duration;
import java.time.LocalDate;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Optional;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.stream.IntStream;
import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.Caffeine;
public class CaffeineCache {public static void main(String[] args) {
var cacheNumbers = 500_000;
var dataAgainstCache = new ArrayList<MySomething>();
Cache<String, MySomething> cache = Caffeine.newBuilder().expireAfterAccess(Duration.ofHours(1)).build();
var start = 0l;
var end = 0l;
var counterExistsInCache = new AtomicInteger();
var counterNotExistsInCache = new AtomicInteger();
IntStream.range(0, cacheNumbers).forEach(i -> {
var name = "name-" + i;
dataAgainstCache.add(new MySomething(name, "Some dummy address " + i, LocalDate.now()));
});
Collections.shuffle(dataAgainstCache);// populate cache
start = System.currentTimeMillis();
IntStream.range(0, cacheNumbers).forEach(i -> {
var name = "name-" + i;
var mySomething = new MySomething(name, "address-" + i, LocalDate.now());
cache.put(name, mySomething);
});
end = System.currentTimeMillis();
System.out.println("Populating cache took " + (end - start) + " ms");// validating cache
start = System.currentTimeMillis();
dataAgainstCache.forEach(d -> {
var cacheResult = Optional.ofNullable(cache.getIfPresent(d.getName()));
if (cacheResult.isEmpty()) {
counterNotExistsInCache.incrementAndGet();
} else {
counterExistsInCache.incrementAndGet();
}
});
end = System.currentTimeMillis();
System.out.println("Validating against cache took " + (end - start) + " ms");
System.out.println("Exists : " + counterExistsInCache.get() + ", not exists : " + counterNotExistsInCache.get()
+ " total : " + (counterExistsInCache.get() + counterNotExistsInCache.get()));
}
}

This is the code using Redis on localhost (docker based), and java jedis library. Note that redis cannot save java object, so we need to serialize / deserialize the java object. In this case, I serialize it into JSON using java jackson library.

On build.gradle.

implementation 'redis.clients:jedis:4.1.1'implementation 'com.fasterxml.jackson.core:jackson-databind:2.13.1'implementation 'com.fasterxml.jackson.core:jackson-annotations:2.13.1'

This is the implementation for benchmark.

package com.course.kafka;import java.time.Duration;import java.time.LocalDate;import java.util.ArrayList;import java.util.Optional;import java.util.concurrent.atomic.AtomicInteger;import java.util.stream.IntStream;import com.fasterxml.jackson.core.JsonProcessingException;import com.fasterxml.jackson.databind.ObjectMapper;import redis.clients.jedis.Jedis;public class RedisCache {public static void main(String[] args) {var cacheNumbers = 500_000;var dataAgainstCache = new ArrayList<MySomething>();var cache = new Jedis("localhost", 6379);var start = 0l;var end = 0l;var counterExistsInCache = new AtomicInteger();var counterNotExistsInCache = new AtomicInteger();var expiryTime = Duration.ofHours(1).getSeconds();var om = new ObjectMapper();IntStream.range(0, cacheNumbers).forEach(i -> {var name = "name-" + i;dataAgainstCache.add(new MySomething(name, "Some dummy address " + i, LocalDate.now()));});//  Collections.shuffle(dataAgainstCache);// populate cachestart = System.currentTimeMillis();IntStream.range(0, cacheNumbers).forEach(i -> {var name = "name-" + i;var mySomething = new MySomething(name, "address-" + i, LocalDate.now());try {cache.setex(name, expiryTime, om.writeValueAsString(mySomething));} catch (JsonProcessingException e) {}});end = System.currentTimeMillis();System.out.println("Populating cache took " + (end - start) + " ms");// validating cachestart = System.currentTimeMillis();dataAgainstCache.forEach(d -> {var cacheResult = Optional.ofNullable(cache.get(d.getName()));if (cacheResult.isEmpty()) {counterNotExistsInCache.incrementAndGet();} else {counterExistsInCache.incrementAndGet();}});end = System.currentTimeMillis();System.out.println("Validating against cache took " + (end - start) + " ms");System.out.println("Exists : " + counterExistsInCache.get() + ", not exists : " + counterNotExistsInCache.get()+ " total : " + (counterExistsInCache.get() + counterNotExistsInCache.get()));cache.close();}}

And this is the WRONG implementation using ArrayList. Never do this.

package com.course.kafka;import java.time.LocalDate;
import java.util.ArrayList;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
public class ListCache {public static void main(String[] args) {
var cacheNumbers = 500_000;
var dataAgainstCache = new ArrayList<MySomething>();
var cache = new ArrayList<MySomething>();
var start = 0l;
var end = 0l;
var counterExistsInCache = new AtomicInteger();
var counterNotExistsInCache = new AtomicInteger();
IntStream.range(0, cacheNumbers * 2).forEach(i -> {
var name = "name-" + ThreadLocalRandom.current().nextInt(cacheNumbers);
dataAgainstCache.add(new MySomething(name, "Some dummy address " + i, LocalDate.now()));
});
// populate cache
start = System.currentTimeMillis();
IntStream.range(0, cacheNumbers).forEach(i -> {
var name = "name-" + i;
var mySomething = new MySomething(name, "address-" + i, LocalDate.now());
cache.add(mySomething);
});
end = System.currentTimeMillis();
System.out.println("Populating cache took " + (end - start) + " ms");// validating cache
start = System.currentTimeMillis();
dataAgainstCache.forEach(d -> {
var cacheResult = cache.stream().filter(f -> f.getName().equals(d.getName())).collect(Collectors.toList());
if (cacheResult.isEmpty()) {
counterNotExistsInCache.incrementAndGet();
} else {
counterExistsInCache.incrementAndGet();
}
});
end = System.currentTimeMillis();
System.out.println("Validating against cache took " + (end - start) + " ms");
System.out.println("Exists : " + counterExistsInCache.get() + ", not exists : " + counterNotExistsInCache.get()
+ " total : " + (counterExistsInCache.get() + counterNotExistsInCache.get()));
}
}

--

--

--

Official technology blog of BFI Agile Thought Community (BATC)

Recommended from Medium

How to animate Strava .gpx tracks in QGIS

SCM — Branching Guidelines for Agile Development

Command Design Pattern In Python

How Ionic Application Development Help SMEs to Grow in 2019

The Best Tool for Your Data Product Journey? A Good Map

Swiss Knife that powers the Swiggy App

Topeka : Transition Animations [ 2]

Best way to replace multiple occurrences of a character in a string

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Timotius Pamungkas

Timotius Pamungkas

A father, a husband, a tech enthusiast. In that particular order.

More from Medium

In-memory cache in Spring Boot

Spring Batch — Parameters Between Steps

Spring Boot Cache with Redis

Distributed micro-services using Spring Cloud — Service Discovery — Part 2