Facebook PrestoDB — Full Review
Should you consider using Presto to query your Hadoop?
This article is meant to answer 3 questions we asked ourselves:
- Can PrestoDB be a good alternative to Apache Impala?
- Is Presto better than Impala?
Through this article I’ll examine PrestoDB according to the parameters I find relevant and in the summary, I’ll answer each of the questions above.
What is Presto?
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
It was developed in Facebook and they use it in production as well as Airbnb, Dropbox, Netflix, Uber, AWS (EMR & Amazon Athena), and more big names.
There is a nice documentation in the official website of Presto, but its not as detailed as the documentation of Impala in the Cloudera website. The code is much less documented than the code of Impala.
Since Presto is used in many big internet companies — there is a pretty good community out there about it:
- 7.2k stars on Github (compared to 2k for Impala)
- 83k Google results (compared to 186k for Impala)
- 3.2k Stackoverflow questions (compared to 2.6k for Impala)
- #126 on db-engines popularity ranking (compared to #33 for Impala)
We can still see that while Presto is more popular on github, Impala has much more google results and a much better popularity ranking according to db-engines (whice considers Google Trends, the number of job offers related, linkedin mentions, tweets and tech discussion).
Update (thanks to James Taylor’s comment): Impala is an Apache project so activity happens on the Apache dev and user mailing lists. The project is only mirrored to Github, so it’s not typical that users would star the mirror.
Currently, the only reliable enterprise support for Presto is by a company named Starburst who also contributed to the code of Presto and are recommended in the official Presto website.
There are 2 main components in Presto: Coordinator and Worker.
Unlike Apache Impala, it’s not common to use more than 1 coordinator per cluster, although from what I’ve checked it’s possible to configure multiple coordinators.
The architecture is simple:
The coordinator is made of 3 components: Parser, Planner and Scheduler.
Its pretty similar to Impala as they both based on Google Dremel.
Presto has connectors that provide Metadata API for the parser, Data location API for the scheduler and Data stream API for the workers in order to perform queries above multiple data sources. A single query can read data from multiple sources — that’s the main advantage of Presto, it’s super-pluggable.
Hadoop Integration — Hive Connector
The Hive connector allows querying data stored in HDFS and mapped with Hive Metastore.
It’s obvious but the execution engine is not Apache Hive or MapReduce, it’s Presto.
The supported file types are: ORC, Apache Parquet, Avro, RCFile, SequenceFile, JSON, Text.
It supports Apache Hadoop 2.x and CDH 4 and 5. But Cloudera doesn’t support Presto in the Cloudera Manager (because they have Impala).
Because Presto is so pluggable, it is possible to configure multiple Hadoop clusters to query from.
Metadata Caching Mechanism
Presto has a metadata caching mechanism that is very poorly documented and not really discussed on the web. I’ll write here what I found and how I understood it.
Presto’s metadata cache is in the coordinator, so it doesn’t suffer cache consistency problem if user only changes objects via Presto. However, if user also changes objects in the metastore via Hive, it suffers from inconsistency until a refresh occurs.
Presto doesn’t have a REFRESH statement like Impala has, instead there are 2 parameters in the Hive connector properties file:
It means that for each entry in the cache, a refresh occurs every X time and after Y time it’ll be evicted from the cache. In release 0.174 the default value of those parameters was set to 0 to “disable cross-transaction metadata caching”.
The fact that there’s a refresh interval and a cache timeout has its pros and cons:
- No need for a metadata refresh management like we have in Impala.
- If the refresh interval is configured to be a small value, the metadata cache will probably not suffer inconsistency issues.
- The cache-ttl parameter is a great solution to our current metadata explosion (topic size) issue in Impala.
- A refresh may be heavy and unnecessary refreshes can be a bottleneck, maybe performing the refreshes ourselves is a more efficient solution.
- In Impala 2.7 a refresh can be on a specific partition and that way its much lighter than a refresh on an entire table. I don’t know if the refresh in Presto is efficient.
- The cache-ttl parameter may cause metadata that is frequently used to be evicted from cache. But it may got fixed in release 0.154 and after each query the cache timeout is being reset. Which means cache entries might never expire (which is good if it’s because they’re being used frequently)
The Presto metadata cache adopts a Guava based LRU cache which has a default expiration of 1h and default max entry of 10,000 (tunable). If we want to learn more about Presto’s metadata cache mechanism, we can read about the Guava cache.
As I said, the documentation is poor on this subject and even the 2 parameters I’ve talked about are not documented in the Presto website, I found them in discussions on the web.
Presto has 2 parquet readers, the first one was developed by the Presto team and the second one added in release 0.138 was developed in Uber.
In an Uber blog post about Presto and Apache Parquet, they say that they’ve chosen Parquet as their columnar format to achieve high performance. But the Presto’s Parquet reader was not leveraging all the advantages of Parquet and they wrote their own one.
The Uber Parquet reader is not enabled by default and can be enabled through the config property: hive.parquet-optimized-reader.enabled=true
The improvements of the new Parquet reader are:
- Nested column pruning
- Columnar reads
- Predicate pushdowns
- Dictionary pushdowns
- Lazy reads
The new parquet reader of Presto is anywhere from 2–10x faster than the original one.
To enable Parquet predicate pushdown there is a configuration property: hive.parquet-predicate-pushdown.enabled=true
In an AWS conference we’ve been to, the recommended file format for Presto was Parquet, so I guess Presto is overall pretty optimized for the Parquet format.
Currently the only connector that supports statistics is the Hive connector.
Presto leverages the table statistics of Hive if available, and there is no way to compute statistics in Presto itself (unlike Impala). Collecting table statistics is done through Hive.
The Parquet format has column-level statistics in its foster and the new Parquet reader is leveraging them for predicate/dictionary pushdowns and lazy reads.
Presto can support data locality when the workers are installed on the same machines as the HDFS data nodes. If you set`node-scheduler.network-topology=flat`, Presto will try to schedule splits on the machines that have the data. See https://prestosql.io/docs/current/admin/properties.html#node-scheduler-network-topology
But it’s important to understand that data locality as a concept is getting much less relevant with the bandwidth today (10GbE). In my opinion, compute and storage should be completely separated, at least when it comes to ad-hoc queries. So if we would use Presto it would be in a separated cluster that is dedicated to Presto.
To query Presto, one can use JDBC/ODBC, which means every sql client and programming language can query Presto.
But the specific tools I think analysts should use to query Presto are:
- DataGrip — great JetBrains product we already use to query Impala.
- Airpal — A web-based query interface tool built on top of Presto by Airbnb. It’s the equivalent of Hue for our use-case and as for what I’ve seen — it’s pretty user-friendly.
As mentioned in the ‘Hadoop Integration’ section, Cloudera doesn’t support Presto in their Cloudera Manager. Therefore, if we want to use Presto we need to use a proper administration tool.
The 2 main tools I’ve found for Presto administration are:
- The Presto Web UI: Presto provides a web interface for monitoring and hanging queries. It basically gives the same abilities and shows the same data as the Impala web UI (at the coordinator level).
- Presto Admin: it’s a tool that Presto recommends for installing and managing Presto on a cluster. It provides easy-to-use commands to:
- Install and uninstall Presto across your cluster
- Configure your Presto cluster
- Start and stop the Presto servers
- Gather status and log information from your Presto cluster
Presto can be an alternative to Impala. The reason is simple: it’s an MPP engine designed for the exact same mission as Impala and has many major users including Facebook, Airbnb, Uber, Netflix, Dropbox, etc. It’s main advantage over Impala is the fact that it is super-pluggable.
But in terms of performance over Hadoop — Impala in its newest version may perform better than Presto.
After everything I’ve read on the web about Presto, I’m convinced we should properly test it in a semi-prod environment and see its actual performance.