SQL Query Engines — Intro and Benchmark

Tal Joffe
Tal Joffe
Nov 1 · 7 min read

Intro

In this post I will follow an introduction to the Big Data SQL Query Engine from another post with a focus on technologies We used or tested at NMC. I get a bit more technical on relevant technologies, I will share our insights and ultimately describe which query engine we concluded is best for our use case.

Motivation

At NMC we have multiple Data Science teams that run analytical SQL queries on our big data. We recently finished a benchmark that tried to find the best SQL Query engine for them in terms of execution time and reliability and also will be easy to manage and scale since we plan to have even more data in the near future.

The Benchmark was tested mainly against this use case:

  • Mix of complex and simple queries
  • Users only use SQL
  • Queries often Query 30 or 90 days of data (>100TB)
  • Execution time should be quick but not necessarily interactive (under 1 minute)

SQL Query Engines

As I described in a previous post, query engines provide the same kind of functionality data warehouses supplied but with a slightly different approach — separating storage from compute with an emphasis on elasticity and scalability. Query engines could be used also as part of the Data Lake paradigm as a tool for extracting data from it.

So how do they work? glad you asked…

general architecture

every engine has it’s own architecture bet to simplify I will present a general architecture

Computation is performed by a cluster that is in charge of translating SQL queries to an execution plan, and distributing this plan among the slaves. A metadata store is used to hold table, function etc which the “planner” will use for interpreting the SQL.

Now hold on since we are going to dive in a bit…

I will explain the different components bottom up — first the storage and file formats and then some architectures of specific engines we tested.

Technologies

Storage

In our test we looked at 2 storage options, HDFS and S3. Although they are not the only storage solutions out there they are certainly the most common ones for big data query engines and the only relevant ones in our use case.

HDFS

HDFS is probably best in terms of performance (read/write latency) but could be expansive and will need some maintenance (especially if installing your own and using a service such as AWS EMR)

S3

Amazon’s S3 cloud storage solution is virtually unlimited and very reliable . It also has some nice features such as encryption but most importantly, when working with AWS cloud servers (EC2) the performance is pretty close to local HDFS.
You should note that S3 is not a file system and behaves a little differently than HDFS. One common example is that you can’t write results to temp folder and then copy them to destination (default behaviour of many engines) since renaming a folder in s3 is translated to copying of all of the files in the path (which take a lot of times usually)

In most of our use cases storing data on S3 was the most cost-effective approach.

File formats

There are many available file formats but I will focus on 4 common ones — CSV, JSON (JSON records), Avro and Parquet:

Bottom line — if your data access is row oriented (you don’t filter by columns often) use Avro, if your data access is column oriented (schema has many columns but most queries only care about a few of them) use Parquet (or another columnar format that works best with your stack — e.g. ORC with Hive\Presto)
In our use case columnar access is the best solution by far.

Check out Format Wars by SVDS If you want to read more about data formats

File optimisation

After choosing the storage solution and file formats we are not done yet. There are still possible optimisation that could affect the performance greatly:

  1. Compression format (GZIP/SNAPPY/LZO) — some offer better compression, some better read/write (usually it depends on file format and query engine technology)
  2. File sizing — prefer larger files to many small files
  3. Partitioning (partition, hash partition, bucketed table) — probably the biggest optimisation consideration, if done correctly could reduce query input data significantly and thus improve performance
  4. Pre-sorting data — Could improve both compression and query performance

The main tradeoff here is between write and read latencies, pre sorting data for instance could be very costly and not all queries will benefit from it.
another tradeoff is between storage and performance — partitioning the data could increase the size considerably (for us it almost triples the data size) but improves performance significantly (in some cases X10 or even better)

Query engines

There are many sql query engines, some of which try to target really specific use cases (like BlinkDB that aims at time bound approximate responses ) and some try to provide a more general approach (e.g. Impala).

In our Benchmark we only looked at 4 technologies — Hive, Presto, Spark (Open source) and Snowflake (commercial SaaS) since they had the best fit for our stack.

Hive

Originally a platform for users to run MR jobs via SQL without knowing Java or the Hadoop framework.
Developed by Facebook and released as open-source in 2008 it is probably the most mature and widely used (especially for ETLs) SQL query engine.
It does not support ANSI SQL but HiveQL which is pretty close to it.
Nowadays the MR (Mapreduce) execution could be replaced with Tez, or Spark to improve latency

Presto

Presto was also developed by Facebook and released as open source in 2013. It was aimed at allowing interactive queries at petabyte scale (as a supplement to Hive)
Unlike Hive, all processing is in memory and pipelined across the network between stages. This avoids unnecessary I/O and associated latency overhead. The pipelined execution model runs multiple stages at once, and streams data from one stage to the next as it becomes available.

Presto was widely accepted in the industry (notably by: Netflix, AirBnB, Dropbox) and has many connectors (Cassandra, Hive ,Kafka, MongoDB, MySQL , PostgreSQL, Redis and more) and clients ( ODBC, C, Java, Node.js, PHP, Python, R, Ruby.). This allows you to easily integrate Presto with existing data sources in the organization (you could also select data from multiple data sources in the same query) and with existing applications.

Spark SQL

Spark SQL is a component on top of Spark Core that provides in this context SQL language support, with command-line interfaces and ODBC/JDBC server.
So basically you can write SQL queries that will be executed by Spark core as if you programmed a Spark job in Scala (or Java or Python).

Spark follows the Hive syntax and also supports using Hive metadata store.
Besides very good performance the main strength of Spark SQL is the ability to use SQL as another API for the Spark Core and thus being able to integrate existing SQL queries into a Spark application.

Snowflake

Unlike the 3 previous technologies that are open source JVM based engines Snowflake is a SaaS product that employs similar architecture but provides a complete solution that includes a Web UI that abstracts all the needed functionality.

Another couple of nice features that come out of the box with Snowflake are

  • Caching — Snowflake has a smart caching that does not only save results but also data that improves performance of similar queries and only identical queries
  • Spilling to S3 — when Memory and disk runs out engine will spill back to S3 to avoid crashing

Bottom line

As We learned from our benchmark and from countless other benchmark that tested the same technologies but all got different results (from each other as well) there is no such thing as the “best Query engine”.
As I hope you got by know there are many possible ways to set a solution and many of the elements depend on other factors in your organization’s stack.

Generally speaking my conclusions of the benchmark were that among the Open source tools Spark SQL delivers the best solution for ad-hoc (but not interactive) SQL queries over Parquet data.
Hive was very slow and Presto only performed well for simple queries (without multiple joins or aggregations).
I also tried Hive over Tez which was similar in performance to Spark but slower on most cases and did not have the reliability benefit Hive on MR had over Spark.

Another issue was configuring, on all open source engines (maybe less so on Hive) cluster had to be configured right (Memory, Disk and other more specific parameters) otherwise very complex queries could crash (this was most evident with Presto).
For ad-hoc queries, the complexity is hard to forecast so that was definitely something to consider.
The solution could be just have a very big cluster that could handle anything but we also cared about cost effectiveness.

While comparing to Snowflake, initially Spark was much better until they introduced a feature called “Data Clustering” which is a kind of smart data partitioning that actually made Snowflake perform best of all engines.

Combining that with the robustness of snowflake and the ease of use for data analytics made it the winner of our benchmark (for this use case)

If got this far and have some questions about something I wrote here feel free to contact me at tal.joffe@nielsen.com

NMC-TechBlog

A publication by the Nielsen Marketing Cloud Engineering team, where we talk about what we do and how we do things

Tal Joffe

Written by

Tal Joffe

Been doing software since 2009 in various positions and disciplines from video streaming and network security to big data and web development

NMC-TechBlog

A publication by the Nielsen Marketing Cloud Engineering team, where we talk about what we do and how we do things

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade