Introducing Snel: a columnar database featuring Just-In-Time query compilation for fast on-line analytics

Cristian Adamo
Grandata Engineering
8 min readOct 18, 2017

--

We have built an efficient embeddable relational database engine featuring Just-In-Time (JIT) compilation of queries with a column-oriented storage. It’s designed for fast on-line analytics workload and it is based upon Efficiently Compiling Efficient Query Plans for Modern Hardware paper by Thomas Neumann.

We called it Snel, which means “SQL Native Execution for LLVM”

Background

At Grandata we work with Telco’s and Financial’s data, hence we have to deal with tons of data. We run a lot of algorithms over such data in a batch fashion over our Spark’s cluster, leaving the data ready to be used by our online analytic platform, Social Universe. Here is where Snel comes into scene, to allow users to perform complex real-time analysis over massive datasets. Social Universe has some strict response time constraints, meaning that results must be delivered within a matter of milliseconds.

Motivation

Back in 2012, we relayed on MySQL but we found that the kind of queries we executed put the database under too much pressure and queries took a long time to complete (at least ~1 minute). This was too much time for a user to wait for a request to be completed, so we started to look for alternatives, but at that time there was almost none alternative databases that could handle the load we had, therefore, we ended up building an in-memory engine called VTor (Virtual Table of Vectors), which was the predecessor of Snel. This engine had all the queries predefined in code and was written in C, therefore it was really fast.

We started to realize that we needed more flexibility in queries without losing performance and also some aggregations currently not supported in SQL, like histogram, percentile among others. After some research, we came across Efficiently Compiling Efficient Query Plans for Modern Hardware paper. So based on our experience building VTor, we started building a prototype based on such paper.

Snel

Snel is an embeddable relational database that relies on LLVM Infrastructure to compile queries and uses SQLite as the main interface. It has been battle tested in production and is currently being used in multiple systems and products at Grandata, whenever we need to support real-time analytics with low latency over massive datasets.

Snel tries to minimize memory accesses by keeping values in the CPU registers for as long as possible. Reading a value from memory takes a few CPU cycles, this is about 100ns. On the other hand, reading a value from registers takes about 1ns, 100x times faster than memory. Of course, we cannot eliminate memory access, but we can try to amortize the access time by working with registers as much as we can. In that way we can try to exploit spatial and temporal data locality, improving branch prediction and reducing function calls (Method dispatch) so we do not break the execution pipeline of the CPU, and ultimately this scheme will speed up our system.

Currently with Snel databases in the order of thousands of columns, millions of rows and terabytes of data, our queries run within miliseconds without losing performance.

Snel currently supports the following key features:

  • Subset of SQL Language and some custom aggregations.
  • Predicate pushdown evaluation.
  • Column-oriented storage.
  • Just-In-Time query compiler through LLVM.
  • SQL interface through SQLite command line tool and JDBC driver.
  • Spark interface.

SQL Language + Custom Aggregations

Snel only supports a subset of SQL Language:

  • Simple WHERE clauses over columns, with AND and OR constraints, and with sub-queries (WHERE column IN (SELECT …))
  • JOIN, GROUP BY, ORDER BY and LIMIT clauses.
  • Aggregations: COUNT, COUNT(DISTINCT), SUM, AVG, MIN and MAX.

Also features some custom aggregations like:

  • Histogram and Fix-Range Histogram computation.
  • Two-Dimensional Histograms, very useful to compute maps.
  • Percentile.

Having these custom aggregations optimized in Snel, let us execute in real-time over a massive amount of data without response time penalties.

Predicate pushdown evaluation

In a typical query evaluation, Snel pushes the predicates down into the query plan tree, that is, values flow from the bottom to the top of the tree. For instance, below you can find the query plan for the following SQL query:

SELECT COUNT(*) FROM table1 WHERE int8col1 > 3

As we can see, predicates are applied first, right after the Table Scan Operator, which minimize the data passed through each node in the tree. This improves query response times in some order of magnitude.

Column-oriented Storage

Contrary to row-oriented storages, where each row is stored with its columns contiguously, that is, each row followed by another (Figure 1), in column-oriented storages, each row of a given column is stored contiguously on disk (Figure 2).

The column-oriented layout is ideal for read-intensive applications since we can read only the columns that we need. As each column can be viewed as a giant array of values, we end up exploiting data locality, which makes this layout optimal for aggregation operations and is also suitable for compression. On the other hand, it is not optimal for insertions, so we designed Snel with inmutable columns by default, but as an alternative, we support column updates too. Also, Snel features plain encoding indexing, that is, based on column’s data type we store each column-rowId pair.

Just-In-Time Query compiler

Each query plan is compiled to LLVM’s IR (Intermediate Representation) Code —which is similar to Assembly in its essence — and instead of having a lot of function calls and method dispatches, which breaks the execution pipeline of the CPU, the code ends up living in a function that has all the logic to process the query inlined. In traditional databases, operators are functions, similar to this:

function tableScan() {
rows = open(database)
for (row in rows) {
yield { row['amount'] }
}
yield Empty
}
function greaterThanFilter(rows, predicate) {
for (row in rows) {
if (row['amount'] > predicate) {
yield row
}
}
yield Empty
}
function count(rows) {
count = 0
for (row in rows) {
count += 1
}
return count
}
result = count(greaterThanFilter(tableScan(), 1000))

In our schema, the entire query is inlined. Therefore, the previous query will be compiled to this:

function query() {
rows = open(database)
count = 0
for (row in rows) {
if (row['amount'] > 1000) {// <-Greater Than Filter inlined here
count += 1 // <-Count inlined here
}
}
return count
}

These ideas are brought by Neumann’s paper, which tries to maximize data locality and predictable branches, maximizing the query execution pipeline.

Based on the previous scheme, you can find below a prototype of a Sum Operator written in C++ using LLVM to emphasize these concepts. This code has the Table Scan with a Filter and an Aggregation Operators all embedded in the same function

Note: this code is a proof of concept, therefore there is a lot of room for further optimizations.

and this is the IR code generated by the LLVM infrastructure shown above

Experiments analysis & Conclusions

Given all of these optimizations we discussed so far, we ran a set of experiments to show the execution differences between native implementations and code generated ones.

In this case, we ran four implementations of our Sum Operator described above:

  • C++ with LLVM’s IR code generation.
  • Native C++ code.
  • Native Scala code.
  • Scala with JIT Java code generation using Janinio compiler.

All of them ran 10 times, over a column with 1 billion rows and using MMap to read the column file, in order to improve read performance.

To be fair in our analysis, none of the implementations had any optimization, neither from compiler nor from code. That is becuase C++ and LLVM implementations have easy to apply optimizations while the Scala and Java JIT implementations are nontrivial.

Note: these tests were run in a Mac Book Pro with a 2.7 GHz Intel Core i5 and 16GB of memory at 1867MHz DDR3.

At first glance, we could say that the Java generated code version is the clear winner, but a key point to look at here is the first iteration (execution number 1), where the query request hits the compilers for the first time. In JVM implementations we can see that the JVM takes a while to perform the query since it is running the byte-code generated by the compiler which is not quite efficient yet, then they fluctuate over time due to the fact that the JVM’s profiler reveals frequently called methods (HotSpot) and the JVM tries to further optimize, as it runs, the already generated bytecode (more on this post). On the other hand, LLVM and C++ implementations are more stable over time due to ahead-of-time compilation characteristics.

Given that all these tests ran the same query over and over again, the most important execution time to look at is number 1, where the clear winner is the LLVM implementation, since in real world systems we could have a cache in front of the database handling subsequent executions of the same queries. We should do that becuase accessing cached query data is way faster than executing Java generated code.

LLVM has more room for optimizations both from the code and compiler sides. For example, LLVM Infrastructure provides a wide range of optimizations to apply, which in LLVM are called Passes, Passes traverse some portion of a program to either collect information or transform the program. These would reduce response times. In Snel, we realized that moving operations around and adding passes could improve speed by 1–4x.

A note regarding Java generated code implementation: it is much easier to implement and read than LLVM, but there are no trivial optimizations to apply so it could be a good alternative to LLVM.

You can find the code used for this analysis and the above C++ LLVM example and Java implementations, here https://github.com/GranData/databases-pocs.

Final thoughts & future work

Writing LLVM’s IR code is tedious, hard and error-prone, and writing efficient code is even harder although we could leverage LLVM compiler to optimize for us in most of the cases. We could also explore SIMD instructions to exploit data level parallelization which could end up in a huge speedup improvement.

Nowadays there are more open-source alternatives that we think we could use to reach the low latency constraints we have and we are open to analyze those alternatives as potential replacements. But we are also working on a different alternative to this model that benefits from not having to deal with low level code as is the case of LLVM’s IR code. Some of these alternatives are: JIT compiled Java as shown above or even C++ code generation.

Snel is still under development, it won’t be publicly available yet, but we hope to improve code base, add support to multi-tenant over a distributed network and more cool features, so it will be open-sourced in the future.

We want to give special thanks to the members of the team that made this project possible Alejo Sanchez and Marcelo Mottalli.

--

--

Cristian Adamo
Grandata Engineering

Head of Engineering, Applied AI at JPMorgan Chase & Co.