Vectorized and Compiled Queries — Part 2

Tilak Patidar
4 min readOct 7, 2018

--

Let’s resume our investigation of vectorized and compiled queries. In part 1 of this series, we examined the volcano iterator model which was the precursor of both vectorized and compiled queries. We saw how it suffered from instruction cache misses and the two alternatives available to solve it are vectorized and compiled queries. In this part 2 of the series, we will discuss vectorized queries.

Vectorized Queries

So, how do vectorized queries tackle the problem of instruction cache misses discussed in part 1 of this series.

Vector instructions apply an instruction on a vector of data

Here is a non-fun wiki definition:

Vectorization is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction, whereas scalar can only operate on pairs of operands at once.

Idea is to take advantage of Data Level Parallelism (DLP) within an algorithm via vector processing i.e., processing batches of rows together.

In easier words:

Operate on a vector (batch) of records using the single instruction.

By, the single instruction we mean that you need not load the same instructions frequently for each row. Load them once and operate on a batch of records. Thus, instruction cache misses will be lessened as the execution of code is not happening for each row but for a batch of rows.

So, instead of reducing the branching and unnecessary abstractions due to general query execution code, vectorization tackles the problem of instruction cache miss by batching the rows. It does not weed out the problem but ameliorates it.

Vectorization and Columnar Formats

Vectorization techniques are popular for columnar formats. There are certain advantages that columnar formats provide which facilitate better performance of vectorized queries:

  • Using columnar formats load only the columns required by the query. Thus, less I/O on reading from disk and more data could be loaded into the cache.
  • Applying an operator on a vector of a column is performant using the tight loops.
  • Data availability increases as the unused columns are not present in the cache. Having the data which is actually required for the query laid side by side like members of an array reduces cache misses.

Applying vectorization techniques on a row by row file format would result in loading the unnecessary columns in the main memory from where the projection of the column has to be taken on which the vectorization has to be applied. Thus, vectorization techniques are popular with columnar formats.

Vectorization techniques

Vectorization exploits SIMD (Single Instruction Multiple Data) instructions which allows executing a single instruction on multiple data.

One SIMD instruction can process 16 integers at a time, thereby achieving Data Level Parallelism.

Addition of numbers using vectorization on SIMD vs scalar mode

Loop Unrolling

This technique exploits the fact that the data is a vector now and is being computed in a loop. In this loop to avoid the unnecessary increment instruction of counters and jumping in assembly code, the body of the loop could be unrolled as in expanded to reduce the number of increments required by the loop to finish computing the result of the vector.

To reduce the number of loops the same loop is unrolled by 4

Loop unrolling is a compiler feature. Manually doing loop unrolling will result in less readable code. One downside of using loop unrolling is that the binary size will increase as the number of instructions will increase. Often loop unrolling is applied on nested loops where the inner loop is constant iteration size.

Loop pipelining

It allows executing the operations inside the loop in a concurrent manner. This results in the interleaving of instructions in the loop. The major benefit of this pipelining is that it reduces CPU stall.

What is a CPU stall? Well, each instruction goes through a cycle of Fetch, Decode, Execute, Write. Often the fetch stage could result in a cache miss due which a stall is experienced in the execution cycle. Running operations in loop pipelining manner circumvent this stall as operations are running concurrently and unabated by other the fetch operations.

The ideal case of loop pipelining

Summary

Vectorization is running instructions on a vector (batch) of data.

Vectorization does not solve the instruction cache issue by removing the unnecessary branching and abstractions from query execution code rather it just batches the data (vector) to reduce the number of instruction cache misses.

Using columnar formats with vectorization yields excellent performance and results in increased data availability in the cache, loading only required columns in the main memory and faster execution of an operator on a vector in a tight loop.

Vectorization exploits SIMD (Single Instruction Multiple Data) instructions which allows executing a single instruction on multiple data.

Further performance gains could be achieved using techniques such as loop unrolling and loop pipelining with vectorization.

This is it, for now, we will discuss compiled queries in part 3.

If you have anything to add or you spotted an error in my post, do not hesitate to contact me, I’ll be happy to make changes and learn a few things.

--

--