Vectorized and Compiled Queries — Part 1

Tilak Patidar
4 min readOct 7, 2018

--

While assessing databases or data computation engines you would have stumbled upon terms such as vectorization support, compile-time queries, or code generation suited according to the query. But what do they mean? Do they mean something? Let’s find out.

Before, going into what let’s ask the why first. Why vectorized or compiled queries are needed? To answer that let’s see who was the precursor of vectorized and compiled queries. It would be the volcano iterator model . . .

Volcano iterator model

Imagine a volcano it is like a cone or at least it is drawn like that by most kids including me :). The below one is definitely not drawn by a kid.

As the shape of the volcano suggests lava(data) is too much at the base and very less of it is fuming out of it.

Here is the non-fun wiki definition:

The Volcano model (originally known as the Iterator Model) is the ‘classical’ evaluation strategy of an analytical DBMS query: Each relational-algebraic operator produces a tuple stream, and a consumer can iterate over its input streams. The tuple stream interface is essentially: ‘open’, ‘next’ and ‘close’; all operators offer the same interface, and the implementation is opaque. Each ‘next’ call produces a new tuple from the stream, if one is available. To obtain the query output, one “next-next-next”s on the final RA operator; that one will in turn use “next”s on its inputs to pull tuples allowing it to produce output tuples, etc. Some “next”s will take an extremely long time, since many “next”s on previous operators will be required before they emit any output. Example: SELECT max(v) FROM t; may need to go trough all of t in order to find that maximum.

In easier words for the brain:

It is a chain of iterators and data flows through them when the topmost iterator calls next() on the iterator below it. Which results in propagation of .next() calls till the bottom-most iterator is called. Each iterator might apply some predicate or other operators. And as visualized in form of volcano data might get reduced as it flows up the iterator chain.

What is wrong with it?

Well, at the first glimpse nothing until people realized that it causes a lot of instruction cache misses.

What is instruction cache misses?

What is this madness you just volunteered yourself into?

Not a food chain just how the data is guzzled down to the register for operations

So, consider your database uses a volcano iterator model. Your data resides on the disk from where your disk manager reads it and writes it into main memory on the request of buffer manager. This piece of information from the main memory will then be loaded into cache memory (L1 and L2). From these caches, the information will be moved into registers where the operations would be performed. However, both the data and the assembly instructions have to be loaded into the cache memory.

So, you have the instructions and the data residing in the cache. But, what if you have too many instructions? Your query was big and so was the generated execution code which translated to more assembly instructions. And a general query execution code resulted in more branching thus more useless instructions in the cache. But, hang on a second.

What do you mean by a general query execution code?

Imagine you are writing a code which supports all the kinds of queries and all data types. Writing such a code would require the right amount of abstractions and also a lot of if-then for handling various data types.

Example: Writing a ‘greater than’ operator for your query execution engine.

The greater than operator needs to know the data type. Date, strings, and ints will result in a different expectation of the behaviour and assembly instructions generated.

Remember the good ole’ switch statement? It causes loading more instructions than necessary.

Branching statements, in general, leads to loading instructions that are not needed because those instructions will never be executed if the condition does not satisfy.

What if I don’t write a general code supporting everything? What if I write just enough code for my own query? What if my query execution engine generates just enough code for my own query?

Those were the kind of questions that would have crossed the minds of the people who first pondered if compiled queries were possible.

And what if this kind of instruction cache misses occurs for each row?

As volcano model operates each row at a time.

Well then we got two alternatives:

  • Compiled Queries Generate instructions with less branching and specific to the query.
  • Vectorized Queries Let the cache miss happen but not for each row, maybe for a batch of rows.

So, now we have the idea of why compiled queries and vectorized queries were necessary. This is it, for now, we will discuss vectorized queries in part 2 followed by compiled queries in part 3.

If you have anything to add or you spotted an error in my post, do not hesitate to contact me, I’ll be happy to make changes and learn a few things.

--

--