Towards building the next generation database query engine
What if yesterday’s database engines are not capable of solving today’s problems?
In the 1970s the relational database was born. Today, these still form the backbone of our information systems, with typical Fortune 500 companies running thousands of SQL databases. Over half of enterprise companies are beginning to explore machine learning, what if this is because their current databases are inadequate for the company’s informational needs?
Relational databases emerged from engineers building simple data-driven applications. The same patterns would emerge: e.g. we have users, and they purchase items, and those purchases have credit-card transactions, and we need to link all of these pieces of data together. By providing support for these relationships in the database, e.g. a purchase belongs to a user, the engineer could avoid writing the same basic software functions in every program.
Aside: This same argument equally applies to today’s graph databases, as they operate on very similar principles of relational databases.
Relational databases have been wildly successful, forming an essential piece of almost any application. With this success has brought a rich deluge of data into database systems. Relational databases are great at supporting the developer defined symbolic relationships in the data (e.g. purchase belongs to user), but have barely any support for the noisy, sparse, probabilistic relationships that arise within the data itself (e.g. users with higher disposable income tend to make more purchases).
This limitation is reflected in query languages (e.g. SQL) themselves. They are famously unfriendly for non-technical business users, so much so that entire teams of data analysts, BI experts and data scientists are drafted in to help non-technical employees access their data. This is no surprise given a simple query such as “get the second highest salary” translates into:
SELECT DISTINCT Salary FROM Employee e1 WHERE 2=Select COUNT(DISTINCT Salary) FROM Employee e2 WHERE e1.salary<=e2.salary
A new approach
At Octavian, we have been working on a new generation of database query engine. It addresses the above limitations by taking a different approach. At its core, it’s very different from current database query engines:
- It accepts natural language (e.g. English) instead of SQL for queries
- It represents data as a mixture of sparse features instead of as items from fixed categories
- It learns multi-step deep algorithms from examples
I’ll explain each of these aspects and how they are beneficial.
It accepts natural language (e.g. English) instead of SQL for queries
Natural language interfaces have recently taken off in the form of smart speakers — Amazon reports selling over 20 million Echos, and 50% of US consumers plan to buy a smart speaker in the next year. Using natural language makes systems accessible to a broad range of people with low barriers to entry.
We’re experimenting with bringing the same benefits to the world of business data.
It represents data as a mixture of sparse features instead of as items from fixed categories
In the real world, nothing fits into neat boxes. Words have many meanings. Sentences can be ambiguous. Concepts and thoughts are related to others, in many different nuanced ways. Fall-leaves, tobacco and leather seem to go together, but why exactly?
Our data representation supports and embraces this deep interconnectedness. We achieve this by representing data as mixtures of sparse features (i.e. many dimensional vectors). These representations are created using learned embeddings and learned transformation functions.
This allows the query engine to better use the nuance of the query’s words to find relevant data. It allows it to aggregate and filter data based on learned sub-categories, of which membership is not binary.
It learns multi-step deep algorithms from examples
Many times in life, we can specify the inputs and outputs, however working out how to get between them is hard (for example, try writing a series of rules to tell if a photo is of a hotdog). The great gift of machine learning techniques is that can work out the middle part in the right circumstances.
Classical algorithms, the ones readily implemented in traditional database query engines, are very rigid. Each step must be a clear-cut decision with easily specified inputs. Contrast this to a learned algorithm, where each step of the algorithm can incorporate many weak signals to work out what to do next. Furthermore, it can do many different sub-steps in parallel, weaving a much more complex solution than could be written by an engineer.
This is like comparing how many people cook in the kitchen versus a recipe book: we measure ingredients by eye, choose ingredient combinations by feel and cook it until it smells and looks good. We improvise. None of this is captured by a recipe book’s set steps and cooking time.
How to build it
Such a radical departure from how current query engines work requires a similar departure in the underlying technology.
We’re using a neural network as the core of the query engine. We present the database information as tables of data and adjacency matrices (e.g. an array of a connections) to the neural network, and let it process the data and query to produce a result.
The network processes the query through an RNN and learned word embedding. This provides both an array of query tokens and also an overall query vector.
The data is then processed through a network reminiscent of the Transformer architecture. After first applying learned embeddings to the data, it is then passed through a series of hierarchical attention systems. These allow the network to leverage task-specific sub-networks and to combine earlier calculations together to form complex aggregates.
You can see a basic working example in this recent article.
Whilst there is too much detail to cover in this high-level overview, some of the sub-networks include (for our graph-processing network):
- Node property recall
- Edge (i.e. relationship) recall
- Using previous step’s output as addressing instructions for the above
- Iterative message passing
- Recalling previous step’s output and transforming them in a range of ways
In summary, we believe that this architecture could present the future evolution of databases. It is able to answer queries just like an Alexia or Siri can, for real non-technical business users. It is able to perform the sort of fuzzy operations that real world questions require.
Building this technology requires a radical re-thinking of the foundations of databases, a lot of innovation and problem solving.