Redis Fragmentation and Optimization

Abhishek Sinha
Doubtnut
Published in
4 min readFeb 15, 2021
Banner

Redis is considered to be one of the best in-memory caching solutions. It’s often used in front of the database to cache the data output of frequently used queries using containers. Containers basically is a wrapper to data sources and abstracts the data source from other parts of the application. Using containers, the caller function doesn’t know whether the data is coming from database, cache or some other source.

Data caching is mainly done in 2 ways:

  • Cache the table independently based on some id: This enables individual tables to be cached independently and if multiple tables need to be joined, individual cache needs to be accessed and the computation is done on the application layer. This is a complex process and shifts a majority of load from the database to the application.
  • Cache complex query output based on input parameters: This is more simpler, but adds more load to the database due to complex joins.

Whenever data needs to be cached in redis, it needs to be serialized to a string value. The most common redis data types used for caching are string and hash tables. Both of these data types work differently in terms of storage.

Let’s first understand how data is stored in memory. Whenever any data needs to be stored in memory, it first requests a page from the page pool. A page is a continuous block of memory which needs to be occupied for storing the data. It’s the smallest unit of managed memory. Page sizes vary from 4KB to 64KB depending on the operating system.

Suppose the page size is 32 KB and some data of size 15 KB needs to be stored. Unfortunately, it’ll occupy the full 32 KB, leaving 17 KB as unused. This page is assigned to the particular PID and the mapping is stored in the page table. If the same application wants to store another data of size 5 KB, it can use the same page now leaving 12 KB unused memory. Whenever data in a particular page is deleted, it checks whether the particular process holds another data in the same page before releasing the page to the page pool.

Picture 1

Whenever we cache some data in redis, it also requests for some pages from the page pool. Now comes the fragmentation which is a big problem when it comes to redis. Suppose we store 5 strings of size 10 KB each. So 2 pages in total will be occupied. Let’s name them as page A and B. Here page A will store string 1, 2 and 3 (3 x 10 KB) and some part of the string 4 (2 KB). Similarly page B will store part of string 4 (8 KB) and string 5 (10 KB), leaving 14 KB as empty. Now we add string 6. This will get stored in page B as we already have enough memory available there, now leaving unused memory to 4 KB. Next we delete strings 1, 3 and 5, leaving behind only 3 strings with total size 30 KB which can easily fit inside a single page, but due to fragmentation, it does not free up any of the 2 pages. Next we add another string of size 15 KB. Since we do not have enough contiguous memory in either of the 2 pages, a 3rd page needs to be requested, effectively causing more and more fragmentation.

Picture 2

Redis internally has a way of defragmenting data if the fragmentation ratio exceeds a threshold, but it’s a costly process and may cause increased latency during that period. This issue multiples when there is a hash table with small strings and a large number of fields in a very large redis cache. This is because redis hash tables will occupy multiple pages due to a large number of fields and won’t release memory even if other keys are deleted and hence result in very high fragmentation. In extreme cases it can even occupy 4x times of memory than it should.

So what’s the solution? Solution can depend on the use case, but the most efficient way to reduce fragmentation is using small hash tables instead of large ones if you are choosing hash tables instead of strings. This a pretty common mistake that people do and end up paying more on the infra. Yes, hash tables are more memory optimized than strings, but it comes with its own nuances such as high fragmentation. Another way is to choose the serialization mechanism efficiently so that the memory occupied by each string value is less. Protobufs can be more efficient than JSON. The longer the string, the higher will be the fragmentation.

So when to choose a string or a hashtable? It’s pretty simple, If you have an object, it’s better to store it as a hash table as the overall size of an object is pretty small in most scenarios. Also serialization/deserialization cpu cost along with space occupied by extra characters in serialized data can be avoided and each field can be fetched independently reducing the network load as well. But when you have general key value pairs, they should be stored as strings as you can manage the TTL independently.

Bonus: A similar thing is observed in case of databases, but in that case the fragmentation is on disk.

--

--