A Machine Learning Approach to Databases Indexes

Krisha Mehta
Computers, Papers and Everything
5 min readMay 5, 2018

The paper for today brings in fresh ideas and makes us wonder how versatile machine learning algorithms can be.
Databases are used almost everywhere. Anyone who is aware of databases knows that they are supposed to give exact answers. No approximations. In this paper, the authors have offered a machine learning approach to improvise databases indexing. This is interesting because while machine learning is a stochastic method, it is used in databases to obtain a very specific answer.
A databases query normally performs one of the three functions mentioned below:

  1. Range request queries, to obtain a particular range. For example, finding the range of the number of users visiting a store on a weekend.
  2. Single value queries which may include finding a particular item at a store.
  3. To check if a record exists in the database. For example, to check if an item is present in the store.

When we pass a query into our database, depending on the types of the query, different indexing methods are used to obtain an answer. The three different types of indexing methods are as follows:

  1. For range request queries, BTree- Index is used.
  2. To look up a record for a single key, HashMaps are used.
  3. To determine if a record exists in the database, BitMap Index (Bloom Filter) is used.

While a lot of work has been done to optimize these methods over the years, little work was done on leveraging the distribution of data that is queried.

From the abstract of the paper we have:

In this paper, we demonstrate that these critical data structures are merely models, and can be replaced with more flexible, faster, and smaller machine learned neural networks. Further, we demonstrate how machine learned indexes can be combined with classic data structures to provide the guarantees expected of database indexes.

One of the biggest disadvantages of these methods is that they assume the worst-case distribution of data. In most real-world scenarios, the data does not follow a specific pattern for which a customize indexing system can be developed. This is where using a machine learning approach can actually help our cause. The main aim of any machine learning model is to identify patterns in the data provided. The authors offer to use this capability of machine learning models and integrate it into database querying. They suggest using a machine learning model to learn data patterns, correlations, etc. of the data, to automatically synthesize an index structure, a.k.a. a learned index, that leverages these patterns for significant performance gains. They further explain how B-Trees, Hashmaps, and Bloom filters can be replaced with learned indexes.

Let us look deeper:

Range Queries:

In this case, the data is stored in sorted order. An index is built to find the starting point of the range. The process is very straightforward. Find the starting point of the range, scan through all the entries till the end point of the range is found. A B-Tree is used to carry out this function. If you want to understand how B-Trees work, this video is quite helpful. B-Trees are used because they are memory efficient. However, processing a node of a B-Tree takes time. At the core, B-Trees are models of form f(key) → pos. As such, for each node that is processed, the model gets a precision gain of 100. Typically, traversing a single node of size 100, assuming it is in the cache, takes approximately 50 cycles to scan the page. The goal is to learn f(x)=y based on the data available during training such that the precision gain is greater than 100 and the time taken is less than 50 cycles. A regression model with squared error is used to predict the position of the starting point(i.e y). A hierarchy of models as shown below are trained that are not only more accurate than training one large neural network but also more cheaper to execute.

The ML model predicts an approximate position of the starting point of the range. However, an accurate answer can be found out by calculating the maximum error the model produces and then using a classic search technique like Binary Search to locate the exact position.

Single Value Queries:

Point indexes or hash maps are used when a query is made to obtain a single value. Hash maps work by taking a function h such that h(x) pos. Traditionally, a good hash function is one for which there is no correlation between the distribution of x and positions. The authors consider placing N keys in an array with m positions. This is done because two keys can be mapped to the same position as suggested by the birthday paradox. In such a case, a linked list of keys is formed at this position. However, this has a few problems. The authors explain that collisions are costly because (1) latency is increased walking through the linked list of collided items, and (2) many of the m slots in the array will be unused. Generally, the latency and memory are traded-off through setting m. If we have the knowledge of the distribution of our data, then an optimal hash would have no collisions and no empty slots when m = n. Hence, the model is trained to learn the cumulative distribution function and use the model as the hash function.

Existence Index

Existence indexes are important to determine if a particular key is in
our dataset, such as to confirm its in our dataset before retrieving data
from cold storage. Hence, this task can be considered as a classification problem. A value either exists in the database or it does not. Bloom filters have been long used to find out if a value exists in the database or not.

From the paper: A Bloom Filter as a classifier

While they guarantee the absence of false negatives, the result can contain false positives. Again, no assumption of prior knowledge about the distribution of data is considered. The authors frame this problem statement as a binary classification task where items present are labeled ‘1’ and items absent are labeled ‘0’. Simple RNNs and CNNs are trained for prediction. An overflow bloom filter is used to avoid false negatives.

Initial results show, that this approach can outperform B-Trees by up to 44% in speed while saving over 2/3 of the memory. For point indexes, the learned index results in a 78% space improvement as it causes fewer collisions and thus, linked-list overflows, while only being slightly slower in search time (63ns as compared to 50ns). The total existence index (RNN model + spillover bloom filter) uses 1.07MB, a 47% reduction in memory.

--

--