How to efficiently store and query hundreds of String Int dictionaries on one machine

Evgeniy Zuykin
Agoda Engineering & Design

--

This article will cover the evolution of data structures that we are using while querying from the historical part of Agoda’s in-house time-series database and showcase example implementations of several on-disk database indexes.

The Problem

During the data ingestion in the real-time part of our database, we intern(issue a stable int identifier) all incoming strings. Let’s call the result of this process a StringInt dictionary. Later, during the querying at our historical part, we first needed to determine which id was assigned to each specific string from the query to operate with ingested data. Then, the results needed to be converted back to strings.

Here is an example of a request:

{
"metric": "whitefalcon.messages.consumed",
"start": "2022-04-05 11:27:21",
"end": "2022-04-05 11:29:21",
"tags": {
"dc": [ "HK", "SG" ],
"cluster": [ "A", "B" ]
},
"groupBy" : [ "messageType" ]
}

To intern the query, we need to do reverse(String->Int) lookup, meaning we need to find an Int id for a specific String. After interning, the request will look as follows:

{
"metric": 5962,
"start": "2022-04-05 11:27:21",
"end": "2022-04-05 11:29:21",
"tags": {
3: [ 12, 14 ],
15: [ 52, 315 ]
},
"groupBy" : [ 411 ]
}

After executing the query, we need to do forward(Int->String) lookup to get actual strings by ints before we can serve the response to the client.

To process each query, we might need to combine the results of querying multiple partitions. Each partition has its dictionary since they were created at different times and by different ingesting instances.

Hence we can formulate problem as follows — how to organize multiple dictionaries in our application to have efficient access to direct and reverse lookups?

Note: potentially this problem could be solved by using some KV libraries like RocksDB/LevelDB, SQLite, or even with some separately deployed distributed systems(although network latency would dramatically reduce overall performance), today we will discuss only handcrafted solutions.

Throughout the article the following tools and languages will be used:

Let’s create a contract for our data structure:

Since ids are being issued sequentially starting from 0, our dictionary is stored in the format of string array. Here string has it’s cell index as id:

+-------------------+----------------+---------------+------
| number of entries | string1 length | string1 bytes | ....
+-------------------+----------------+---------------+------

The number of strings in each dictionary varies a lot depending on multiple factors (partition time, partition size, etc), as well as content of the strings themselves. For simplicity let’s imagine all dictionaries have strings containing 20 alphanumeric symbols, with around 100k strings, which sums up to ~ 2MB of raw data on disk for each dictionary.

Let’s define our constraints:

  • App instance size — 8 cpu cores + 16 GB RAM.
  • Workload — 1kk direct lookups and 3k reverse lookups per second
  • Total number of dictionaries ~ 10k
  • ~1.5k dictionaries are required for 90% queries
  • All dictionaries are immutable

In-memory hash-maps

Let’s start with straightforward approach — load dictionary fully into memory from disk. It makes direct lookup fast since we just need to access array cell, and for reverse lookup just aggregate a Map[String,Int].

For those who are unfamiliar with Scala strings.zipWithIndex.toMap does the following:

  1. It maps string array [“hello”,”world”] into tuple sequence [(“hello”,0), (“world”,1)]
  2. toMap function aggregates created sequence with indexes into Map[String, Int]
Flamegraph for dictionary creation

For each request, additional array of 100k objects plus actual reverse map will be allocated. Reverse map will have boxed integers, meaning 100k more objects(we could mitigate it by increasing integer cache size, but what if our dictionary would have 5kk strings?). In total, it will sum up to several hundreds of KB of allocations just to convert our string array into queryable data structure.

We will keep our CPU busy creating our dictionary, and later GC(Garbage Collector) will burn the rest of the CPU. And this all is just for one instance. What if you need to read about 100 of them for each query?

We could improve this behavior by implementing the following changes:

  1. Use primitive collections, for example koloboke, to avoid boxing of primitives.
  2. Rewrite fancy monad chain into old plain for-loop which will not require excessive allocations

Although these changes improve the CPU workload, it doesn’t drastically change the picture. After looking at 80% of flame-graph busy with dictionary creations, the next idea we are coming to is simple — let’s cache them all. Sounds good; we will not burn our CPU, and we will use our RAM instead.

For caching let’s use LRU policy, otherwise we will run out of memory sooner or later. Considering our app instance size, we can dedicate around 10% of it to the cache. We will then start evicting the least recently used elements if our cache size exceeds 1.6GB. This will help us to store dictionaries which are used for the most of requests and invalidate the ones which are used for least popular queries.

For the current implementation, each dictionary (with our constraints) will have a size around 9MB(string instances, raw strings array, reverse lookup hashmap). It means we can have just a bit more than 150 dictionaries on each instance at a time. That’s not enough — we will have a lot of repeated cache invalidation, since we need much more dictionaries to serve most of requests.

Luckily for us, we’ve observed an interesting pattern — you don’t need all strings in dictionary on each request. Usually, you need less than a hundred distinct lookups..

It changes a lot. Now we can create our Map[String,Int] lazily, saving us ~60% of RAM compared to the previous solution.

Now we can re-implement our dictionary in the following way:

Here we see multiple things changed:

  1. strings array is not holding String instances anymore, instead we store bytes read from disk, to not allocate memory for String instances
  2. On reverse lookup we are using brute force search to find the required string. We aren’t really worried about O(N) search complexity because it will not be done more than once for each string.
  3. Use TrieMap(lock-free scala analogue for java ConcurrentHashMap) for caching the results of the search and make the class thread-safe.

Now our implementation requires ~4MB, even though we are storing raw bytes, we are storing inner arrays for each entry. It costs us 1.5 MB compared to the raw disk version. Now we can cache up to 400 dictionaries at once, which is better but still not enough, since our eviction rate still be too high.

Storing strings on disk

Since we can’t store raw data in memory, the only choice is to leave them on disk, and apply some indexing on top.

Important — For simplicity and readability, code listings will be missing some details like caching of lookup results in TrieMap, pooling of opened RandomAccessFile instances to make them ThreadSafe, etc.

Let’s create a base class for disk based dictionaries:

Upon obtaining a new array of strings, create a new file to do the following:

  1. Write bytes for all incoming strings down in a file remembering byte offset for each next string.
  2. To get the length of the string we can subtract the offset of the next string from the offset of the current string.
  3. Upon direct lookup, use RandomAccessFile to access a specific string, by seeking position to the start of its offset and reading its length bytes.

Now, direct lookup looks simple and solid, but what about reverse lookup? Let’s go through multiple solutions and discuss their Pros and Cons.

BruteForce

The simplest idea is to perform a sequence scan on all file content upon each reverse lookup until the next string matches with the specified string or file ends.

This implementation requires just about 400KB in memory, since it’s storing just a simple integer array, which means we can operate with tons of dictionaries at a time. But compared to the previous approach there is a big disadvantage — the difference between brute force search in heap and on disk is tremendous. That means we need to optimize reverse lookup somehow.

Binary search

One of the search techniques we could implement is a binary search. But to use it we would need to order our strings somehow. For that purpose we will use java.util.Arrays.compare method which compares two byte arrays lexicographically. But we still need to have a way to random access our strings by their initial index in the array, so we will maintain the mapping of the sorted string index to the initial index to the string.

Hence our improved search complexity has increased from O(N) to O(log2N) and our memory usage is still reasonable, jumping from 400KB to 800KB since we added one O(N) space Array[Int].

Let’s get some numbers on performance. For each benchmark, we will run lookup by string for a random string from the initial list(id must always be present). To persist fairness of benchmarks between methods, we will use random with specific seed for it to always generate random data in the same order, meaning the same strings and ids for lookup.

Benchmark                                   Score       Error  Units
bruteForceDisk 2.582 ± 0.330 ops/s
diskBinarySearch 37412.364 ± 2205.060 ops/s

On a 1kk string, binary search is much better than brute force, but still, can we improve its performance somehow? Let’s check our flame-graph:

We can clearly see here — RandomAccessFile.seek() is a most expensive operation. With the second most costly operation is RandomAccessFile.read(). Reading bytes from disk is less expensive than moving the current offset for file descriptors. For 1kk strings, we are doing on average log2 1000000 ~ 20 iterations, each iteration is doing 1 seek and 1 read, meaning ~ 20 disks seek calls to search one value. Is there any way we could reduce the number of seek calls?

Simplified B-Tree

B-Tree index is one of the most common indexes in relational databases. In short, this kind of tree allows each tree node to have more than two children. Since we now have higher branching, our search complexity drops from log2 N to logM N, since on each step we will drop M-1/M of possible strings to search. And the most important — we will have less disk seeks!

Sounds promising, but how do we build it? The procedure of building a balanced BTree is really sophisticated. Luckily for us, each dictionary is immutable, and we can cut corners by simply splitting our list into a multilayered tree. Here is how we do it:

  1. Define a recursive function with left (L) border and right border (R).
  2. If L becomes bigger than R, stop.
  3. If the number of elements between L and R is less than the branching factor, write node information about just these elements, without any children.
  4. Else we calculate the number of children nodes for each node.
  5. Split the L to R range into a branching factor number of ranges.
  6. For each range run step 2.

Now let’s discuss how we serialize node information. We will save the relationship between tree nodes (child node ids) on disk otherwise, it will take too much memory. The original string id is stored as a value in each node. Information about its disk start and total bytes length will be saved into arrays for each node.

First, we write the number of children the node has. For each child, we write the original string id and child node id(or -1), and then for each entry, we write string bytes. During this serialization, we fill up info about each string offset and length to maintain the ability to randomly access string bytes by id.

Now, we have a tree, and we need to query it somehow. The idea is pretty simple:

  1. Cache the root of the tree, since it will always be used.
  2. Compare word bytes with each entry string. If strings are equal, return the entry’s string id.
  3. If word bytes are lexicographically less than current entry bytes, dive into the child node of current entry.
  4. Seek to the disk offset of the child tree node and read it fully into memory. Start from step 2 until all until all nodes will be compared or required string will be found.

Whew! That was pretty intense. Was it worth it? Yes! Memory consumption of this data structure is 820KB, which is just 20KB more than the BST approach. That’s because most of the info is stored on disk, and compared with BST, we store only 2 log M (N) arrays of space, which is very efficient.

And what about performance?

Benchmark                                   Score       Error  Units
bruteForceDisk 2.582 ± 0.330 ops/s
diskBinarySearch 37412.364 ± 2205.060 ops/s
diskBTree 103985.625 ± 10465.966 ops/s

On 1kk strings, BTree is about three times faster than BST — definitely worth the effort! But can we do better if we occupy a bit more memory?

Hashmap

Initially, we kept a HashMap String -> Int, but we can’t do this since we can’t store strings in memory. Now, instead of storing string as a key, we will store its hash, and instead of string id, we will store disk offset. But since any hash function might have collisions, we will account for this as a separate map into a list of offsets. Hence for reverse lookup, we need to do the following:

  1. Calculate the hash of searched string
  2. If we don’t have such hash, return NullId
  3. Otherwise get dedicated string offset
  4. Seek file position to it and read its length bytes from disk
  5. Compare bytes from disk with bytes of searched word — if they are equal return string index, otherwise return NullId

That’s how we improved our search complexity from O(log N) to amortized O(1) by storing more meta info. However, we will need 2.5MB of heap space for each instance. It’s just twice as less as storing the whole dataset in memory as raw bytes.

And how much better is performance?

Benchmark                                   Score       Error  Units
bruteForceDisk 2.582 ± 0.330 ops/s
diskBinarySearch 37412.364 ± 2205.060 ops/s
diskBTree 103985.625 ± 10465.966 ops/s
diskHash 370910.145 ± 57589.846 ops/s

It’s almost 4 times faster than BTree! But how can we improve memory consumption?

Hash buckets

What if we could create a limited number of entries in the hashmap? Instead of storing a map of hash -> offset, we could split all values into buckets by modulo of their hash by the number of buckets (could be picked up heuristically depending on the total number of strings). Hence, we can group all strings into buckets by modulo from the hash of their bytes by the total number of buckets. We can also play with the expected number of elements in bucket — the bigger the number, the less buckets we will have and less memory overhead. But you will need more data to iterate through on each lookup.

Let’s measure performance and memory consumption for several different elements in the bucket:

elements in the bucket                              Score  Memory
10 298102.291 1.4MB
20 249440.313 1.3MB
50 224918.313 1.2MB
100 165778.721 1.2MB
200 108862.231 1.1MB

We can see that a decrease in memory consumption stops after 50 elements per bucket. Still, throughput is dropping significantly, which is happening because we are reading too many strings in one bucket lookup(read is becoming more expensive than seek). Hence the equilibrium on our workload is 20–50 elements in the bucket.

Now we can write down the final benchmark list:

Benchmark                                   Score       Error  Units
bruteForceDisk 2.582 ± 0.330 ops/s
diskBinarySearch 37412.364 ± 2205.060 ops/s
diskBTree 103985.625 ± 10465.966 ops/s
diskBucketedHash 224918.313 ± 42852.302 ops/s
diskHash 370910.145 ± 57589.846 ops/s

As we can see — all of the approaches have their Pros and Cons:

  • In-memory map is simple but very RAM heavy.
  • BTree has very good memory/performance ratio but is pretty complicated.
  • Hash is really fast but still requires more RAM compared to other ones.

Regarding the approach we chose in our system — until now it was a Binary Search approach which we might change to bucketed Hash or BTree.

I am confident that there are more efficient and more sophisticated ways to solve our problem. If you know about them, please reach out to me :)

All code is available on my GitHub, in case you want to play with it.

Also check out the article of my colleague Sergei, there he described how we improved our ingestion rate in the realtime part of our database by implementing ConcurrentStringInterner.

Cheers!

--

--