Presto In Production

Published in

Hadoop Noob

3 min readNov 3, 2017

So what’s Presto

“Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.”

Challenges

The most important problem we want to solve here is the Data Visibility. We Already have data sitting on S3 and HDFS, we want to have a Minimum effort to bring Data to everyone. When we start looking, we want to find a tool to fits our needs:

Data exploration. We already implemented our ETL pipelines using Flink, spark. We have our warehouse in both S3 and HDFS. Before Presto, we usually open a pyspark shell and write SparkSQL. It works well for people at Data Team since we all know spark. But people don’t know spark or even programming will have no clue how to access the data. We are tired of writing scripts to export the data to Postgres so people can access to through SQL client.
Troubleshooting data pipelines. Everyday, we work on bringing more data to our warehouse. Usually data flows between several systems. For example, we have a kafka topic to log all events. Then we annotate the record and splits them into different topics. Each sub topic, we will use flink to persist the data into HDFS. We try our best to test pipelines but no one writes bug free code. So when we have discrepancy in warehouse, it is hard to debug, since you need to pull the data from different data source and figure out where exactly the bug is

Presto Setup

Above diagram shows our current Presto setup.

Our data warehouse is on S3 and HDFS, we maintain external tables mapping in hive metastore. Once the new partition of the table is created on S3 or HDFS, we add the new partition to hive metastore. Because we use flink streaming to ingest the data to HDFS, we can archive very low latency on data warehouse.
Presto can bring warehouse, postgres and Kafka data together. It is much easy to troubleshoot data discrepancy because we can track the record across different systems. Presto supports pluggable connectors that provide data to query. So Whenever we add another data source, we can easily bring it into Presto.
Presto supports different kinds of clients so we can use our favorite BI tools. Because it supports JDBC and ODBC, it is very easy to connect Tableau, Zeppelin and Dbeaver to Presto. Presto also provides python driver like pyhive, so you can programmatically use presto in the code, for example, turning the Presto result into a Pandas dataframe

Scale Up/Down

Our Presto cluster is busy at work hours and less busy at night. We give it more resources when it busy. We leverage our Mesos’s Marathon framework to achieve this. We run Presto under Marathon, if we want to scale it up, we just need to add more instances. At night, we scale it down by reducing the presto workers. Marathon will also restart the worker or coordinator if it dies.

We are also experimenting metrics based scaling. In the future, when the CPU or Memory Usage is tight, we will scale it automatically, when it is idling, we take back the resources.

Performance

We did some benchmark on Presto but performance is not the most important factor. There are lots of Presto performance benchmarks available online.