Testing Presto as an ad-hoc SQL engine for RDR data

Lesson learned from testing Presto as a SQL engine on top of our “data lake”

As part of the NMC (Nielsen Marketing Cloud) architecture, most of our data will be stored in a repository called RDR (Raw Data Repository). 
In short, the RDR is inspired from the Data-Lake concept. It will contain most of NMC’s unfiltered data and will be implemented by parquet files on AWS S3.

The RDR will have many features but here I will discuss querying it using ad-hoc SQL queries, internally by our Data Science and Ops teams and eventually by the many different applications using the NMC infrastructure. Currently we are using Data warehouse Solutions (on-premise: IBM Netezza and in the cloud: Snowflake) that require us to load data from s3 to the warehouse, run queries there and export results if needed back to s3.

What Presto brings to the table is querying our data directly on s3 Parquet files (no additional ETL) and therefore seemed a good solution to investigate.

Since there is plenty of information regarding Presto online I will focus on my insights and not on Presto itself.

The Test

the setup for the test:

  1. Where — AWS EMR to run Presto with 18 R3.2Xlarge instances (Since Presto does all of its computing in-memory).
  2. What — The data was actual data we use in our system (one day of data from 2 different schemas) and was about 1TB in size.
  3. How — The SQL queries were also real life queries we use internally in eXelate

The results of the tests I ran:

Optimizations

Initial results of Presto were up to 3 times slower than the results above. 
after investigating many possible optimizations, here is a summary of the most interesting ones by category:

1) Parquet file

  • Files should be at least dozens of MB in size (up to 1GB)
  • Pre-sorting the data will improve querying time and also save disk space
  • Partitioning can improve performance (use msck repair table on hive to create table from partitioned folders)

2) Query optimizations:

  • Count (distinct col1) performs poorly (because it is not distributed) — using approx_distinct(col1) improves performance considerably
  • When doing join, big table should be first (since second table is distributed among all nodes)

3) Presto configurations:

  • After many tweaking these are the configurations I used (explanations inside):
config.properties:  
query.max-memory=900GB //to accommodate memory intensive queries
query.max-memory-per-node=25GB //to accommodate memory intensive queries
query.initial-hash-partitions=18 //distribute joined tables across all nodes
optimizer.columnar-processing=true //Parquet is columnar
optimizer.optimize-metadata-queries=true//rewrites aggregation queries that
//are insensitive to the cardinality of the input
hive.properties:  
hive.parquet-optimized-reader.enabled=true
hive.parquet-predicate-pushdown.enabled=true
jvm.config:  
-Xmx45g//for increase in max-memory-per-node

Generally speaking, Presto’s configurations doesn’t need much managing (also according to the official documentation) so most of the optimizations lay in the way data is stored, the configuration and setup of the cluster itself (e.g. using servers with memory optimization) and much like any SQL engine, the queries themselves.

Conclusion

Presto is great but does not provide a full solution. When researching I learned that most firms use it alongside Hive (or a similar robust engine) for interactive queries which are not heavy on memory (e.g. multiple joins on large table) while Hive crunches the more intensive queries which are less time-bound.

I am not saying Presto cannot perform those queries, it can, the problem is that it will take a very large cluster to run them which will be problematic cost-wise.

Another important issue was that the performance of presto wasn’t as good as I expected and worse than the data warehouse solution we currently use (a bit faster on simple queries but on complex queries up to 4 times slower for the same compute cost)

This was strange to me since everywhere I read about Presto it was mentioned as lightning fast. My assumption after a lot of research is that:

  1. Presto is Java based solutions like Snowflake and Impala will have the advantage in performance since they are more native.
  2. Presto works best with ORC and Parquet is still not optimized (although it is in the roadmap of Presto)
  3. Presto excels when running on very large clusters (Facebook and Netflix uses >200 servers in production) and perhaps on medium sized cluster it will not shine.

Putting speed aside, Presto has many features and integrations and continues to grow which makes it a very powerful tool. 
In terms of administration, companies like Amazon, Qubole and Teradata help running Presto without much trouble and also provide supplementing tools. 
Many companies like Presto also because it is not only open source but also written in Java and has good structure and documentation which makes it easier to customize

What’s next?

The next step will be to use much larger data set (90 days instead of 1 day) to see if it performs better than other tools on larger data and also test additional technologies for more complex queries.

Like what you read? Give Tal Joffe a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.