Application of ML: Building indices?

Introduction

Ameya
Ameya
Sep 3, 2018 · 10 min read

In this post, we will cover an exploratory paper by Google that proposes a very novel approach for building more space-time efficient data structures, such as a B-Tree index, by applying Machine Learning techniques to the underlying data. Traditionally, indices such as B-trees or Hash maps are built with static heuristics and assumptions. The static heuristics don’t take into account underlying data distribution. A B-tree might not be the most optimal data structure if one knew that the range of keys were from 1 to 100M and records were of fixed size. In such cases, an index that just uses the key itself as an offset will be more optimal. Obviously, this example is simplistic but illuminating. It is always not possible to understand all the data distributions manually a priori and build those optimized indices that can perform well for all types of data.

The paper posits an interesting question on whether similar improvements would be possible if data structures were actually built by learning the data distribution rather than statically beforehand. This paper focuses on read-heavy workload and doesn’t advocate replacing these data structures just yet, but works towards understanding if the continuous functions describing underlying data might make for more efficient data structures, specially for large scale scenarios.

Intuition of the idea — Are indices models?

Generally, whenever fast lookup is needed, indices have been used in computer science. In databases, for range based queries, B-trees are used. For point queries, hash maps are used. For set membership, bloom filters are used.

If we focus on B-trees for a moment, B-tree traversal can be thought of as a model that predicts a disk page that might have the key you are looking for. So the initial traversal of the tree(from the root to the leaf) is predicting(deterministically)the starting position in the sorted array. The sequential search from this starting position minimizes the time to find the actual key. (See the figure “a” below). If a machine learning model could learn the overall patterns of keys, then it can similarly predict where in the sorted array the key is(See the the figure “b” below). Similarly, a bloom filter can be thought of as a classifier that predicts 1 if the key is found and 0 if the key is absent.

B-tree predicts the page in which the key can be present. For a learned index, similarly a model can predict the page where key might be present within some error bounds.

Another trend, as outlined by Jeff Dean in SysML talk, that might make ML based data structures more compelling is that, Moore’s law is plateauing. So workloads that can use vectorized instructions such as SIMD or GPUs can extract more from the hardware. Look at the blue line which shows single core performance flattening out.

https://www.karlrupp.net/wp-content/uploads/2015/06/40-years-processor-trend.png

Range index(B-Trees) models and CDFs

B-tree like indices are beneficial for querying a range of data. Such indices are built on top of sorted datasets. The sorted data is divided into pages. Each page gets a key in the index because of space constraints. So B-Tree maps a key(s) to a page on disk. So it can be thought of a regression tree with min-error of zero(first key in the page) and max-error of the page-size(last key in the page). So one can potentially build these indices as ML models like regression and neural nets, as long as errors is within the same limits.

The machine learning based model can be trained by using predictions for a key and recording how much off the prediction was by. As long as this error is within the same bounds, the model will be equivalent in semantic guarantees that B-trees provide. In addition, for B-Tress, since the data is sorted, one can do local searches to correct for errors quickly. For insertion of keys, a B-tree needs to be balanced, similarly ML model would need to be retrained. As neural nets are good at learning wide variety of data distributions, they can be used to build efficient data structures for various data access patterns. The challenge is balancing complexity of models with the accuracy and semantic guarantees. The paper provides some good calculations on complexity of B-trees vs that of an ML model. ML models can function much better with the help of SIMD instruction sets available in most hardware or with growing GPUs.

Let’s assume that we care about simple dense in-memory indices of fixed size records for time being. These are pretty common in a lot of systems. In addition, most of the research in this paper is on integers/real values and not on strings. Modeling strings needs further research.

Range Index models as CDF models:

Cumulative Density Functions learn the data distribution. Given a dataset, it calculates the probability, that if you pick a random threshold, then what is the probability of any number in the dataset being less than or equal to the threshold. Let’s say we have recorded weights of 100 people and if we model a CDF F(X) for it. Now if we pick 250lb as the threshold, then F(250) will give us probability of how many people have weight less than or equal to 250lbs. In this case, the probability might be something like 95%.

If we extrapolate this idea for Range indices, what we are looking for is, given a key, what is the position of this key in the dataset. In a sorted dataset, it converts to: Given a key, what is the probability of this key being greater than or equal to keys in the dataset. For example, if we are modeling a CDF for key accesses from 10 to 1M and then looking for position of key 999 in this CDF index. Then picking the key 999 as the threshold, the CDF would give us the probability of any key in the dataset being smaller than or equal to 999. In this case, may be it would be 10% or even smaller. So the position estimator function could become: p = F(key) * N where N is the number of keys. So this CDF predicts that for a sorted array, the key is located at 10% of the dataset. Then once can go a little back or a little forward to find the actual position. For training purposes, a regression model can learn the data distribution by minimizing the the least squared error.

A naive index:

Authors initially built a naive index, but that turned out to be significantly more expensive than B-trees. The main conclusion from that was that: Neural Nets or Regression models can generalize well at a high granularity. But as you zoom in more irregularities appear. See the example below, where the function appears really smooth, but as you zoom in to the “last mile”, the patterns are pretty spread out.

Smooth CDF function, but wide variety at a finer granularity

An experimental Index Generation Framework

To optimize the bottlenecks mentioned above, this paper, builds a new learning index framework( LIF) that can generate index code given some configuration. It builds regression models by itself, while used Tensorflow for NN. It then extracts out the weights generated by NN and builds an efficient C++ code that can be used as index.

To solve, the last-mile accuracy problem mentioned above, the framework builds a hierarchy of models — Recursive Model Index(RMI). At stage l, there are Ml models. Each stage predicts a certain position for the key. As you go down the hierarchy, the model predicts a newer position using the previous position prediction by the model from the last stage. So every stage is iteratively improving the prediction made by the last stage. For example for number of keys in millions, stage 1 might predict the key positions in the order of accuracy of 10 thousands. Next Stage will look into these 10K and then narrow it down to 100s, so on and so forth. The terminating condition for recursion can be a naive model trained for some mean squared error.

Each stage consists of multiple models and they are narrowing down the range of positions for the given key by using a model in the next stage. Note that in this scheme there can be overlapping ranges Model 2.2 refers to 3.2 and 3.3. both. While 2.1 also refers to Model 3.2

Another advantage of this approach is that, at top of the hierarchy one can have complex neural nets, while at the bottom simple linear regression. This can achieve a more optimal tradeoff between space-time. In such hybrid models, if errors are too broad, then virtually all intermediate models can become B-trees. This choice can be facillitated by error measurements for these models.

Some of the experimental results in the paper are quite amazing. In some cases, B tree takes 50MB for storage, while learned indices take only 0.15MB. In addition, they are 2.7 times faster than the B-Tress for lookup. The experiments use 2 stage RMI, with 10k, 50K models at the second stage.

Point Index — Hash map

Traditional hash maps are implemented using hash functions. Depending on the efficiency of hash functions, there can be collisions in the hash map. Number of collisions affect the performance of hash map considerably on both storage as well as lookups. In hash maps, the keys are not stored in a sorted order, keys are just mapped to some position.

The paper proposes building a hash function by learning the CDF. Then one can scale the CDF to get the hash for the given key. Let’s say there are N keys in the dataset and the intended size of the hash map is M where M < N. Then a potential hash function could be: H(key) = F(key) * M . If learned model predicted the CDF perfectly, then there would be no collisions.

Traditional hash map with lots of free slots and lots of collisions. The learned hash map with better utilization of the hash map and fewer collisions.

Consider a dataset with keys as 10, 15, 15, 17, 20. CDFs for each key respectively then would be 1/5, 3/5, 4/5, 5/5. If we wanted to have a hash map of size of 3, then H(15) would be 3/5* 3. Others can also be calculated similarly. As you can see, the hash map implementation is decoupled from this. One can use probing or chaining depending on what works. This mechanism only provides a way to learn the hash functions that minimize collisions and improve overall utilization of the table.

This scheme may not always work. If the dataset follows a uniform distribution, then all keys in the dataset have equal probability. So a learned function can be modeled easily by a simple linear regression — since the CDF of a uniform distribution is a straight line for the given data. In such cases, the hash function mentioned above in not very advantageous — it will perform same as any usual hash functions.

In terms of implementation, the RMI mentioned above can be again used to generate the hash function. For experimentation authors used 2-stage RMI with 100K models at the second stage. One of the big issues in large scale key space is memory constraints for hash maps. While Model based approach turned out to be a little slower, the memory savings were huge — in the range of 20 to 60%.

Overall, conclusion seems to be that depending on dataset and the scenarios in play, learned hash functions may be worth considering.

Existence Index or Set membership — Bloom filters

Bloom filters are traditionally used to check for set membership using memory optimized structures. Bloom filters guarantee that they will have zero false negatives — If a key isn’t in the bloom filter then it will predict that with 100% success rate. If a key is in the bloom filter, then it can have false positives and further checking is needed.

Classification problem: One can think of Bloom filters as classification problem. For all keys in the dataset, the model predicts 1 and for all non-keys model predicts 0. A very naive example of this could be that if our set consisted of all integers from 0 to N, then f(x) = 0 ≤ x ≤ N can return 1 for all keys very efficiently. For more complex data, one can potentially train a RNN with sigmoid activation function since this is binary classification. In addition, the f(x) used in sigmoid can be used as a probability indicator and we can set some threshold. Anything above threshold indicates the presence in the dataset. So anything less than the threshold won’t be considered in our dataset. Hence, another overflow bloom filter can be created for such keys to enforce the False Negative criteria mentioned above. It is important to note that this mechanism needs some awareness of non-keys.

Learning hash functions for bloom filters: A neat idea here is that what we need for a bloom filter hash is that a lots of collisions among keys and lots of collisions among non-keys. But a very few collisions between keys and non-keys. This can be done with the NN model mentioned above. Consider M as the size of the bitset. If we use the same f(x) mentioned in the last section and use f(x) * M as the hash function, then higher probability f(x) i.e. keys will map to higher position in the bitset, while lower probability of non-keys maps to the lower positions in the bitset.

From the experiments done in the paper, bloom filter size seems to get reduced by roughly 15% when using model based approaches.

Conclusions

I thought this was a pretty cool paper that outlined a very novel approach on building data structures using ML techniques. More needs to happen for sure before they become commonplace, but this is a very exciting application of ML techniques.

Coinmonks

Coinmonks is a non-profit Crypto educational publication. Follow us on Twitter @coinmonks Our other project — https://coincodecap.com

Ameya

Written by

Ameya

Coinmonks

Coinmonks

Coinmonks is a non-profit Crypto educational publication. Follow us on Twitter @coinmonks Our other project — https://coincodecap.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade