Why our Team chose Presto — A Distributed SQL Query Engine for Big Data

Published in

99P Labs

4 min readJan 8, 2021

At 99P Labs, our mission is to enable sustainable mobility research & data-driven innovation and that is why we have opened our diverse mobility data sets through our 99P Labs Developer Portal to our research community. These data sets are by no means light weight. They are large and complex data sets that have engineering challenges under the hood that need to be solved.

The topic at hand is not exactly a debate of our technology choices but to share our journey, learnings and experience with everyone. Our journey started by building our data pipelines to feed Postgres. We then built our data API’s for consumption using Postgres as our data source.

Very quickly we started running into limitations. In summary they were:

· Management of data scale

· Processing issues

· Query bottlenecks

To address some of these early issues, we limited the data shape and size by implementing features like hot data only or geofencing the data or limited query types. Our early users immediately gave us feedback which was really useful for us to validate that our assumptions about these short-term counter measures were not going to fly in the long term. So we had to quickly get back to the drawing board to brainstorm on ideas. Our team started looking at the Big Data technologies landscape and we landed at Presto especially from our use cases point of view. We use other technologies like Hive, Spark etc. but being able to do the interactive analytics over the data sets we have, just made a lot of sense for us to explore this Open-Source technology.

The first thing we did was to understand the mechanics of Presto was device our query experiments. We identified different type of queries, namely:

· Show/Give me all data

· Filter

· Needle in a haystack

· Joins

We chose a data set (~150 million rows) to run these queries over Postgres, Hive and Presto. Very quickly we learnt that Postgres was very effective in the first query type i.e. Show/Give me all data (e.g. select * from <table_name>). With indexing and other tuning capabilities, it performed well but generally with scale of the data, the performance started degrading. We ran this type of query on Hive and Presto then and you can see in the graph below that Postgres does really well against these other technologies:

The struggle is when you actually want to execute queries like filtering and needle in a haystack which make a lot of sense when you provide API’s or doing Analytics. For the Hive and Presto tests, we leveraged the same data as parquet formatted files partitioned by date. We learned that Presto unlike Postgres would filter out the partitions not needed and only look at data that was in question. This immediately gave us exponential performance gains. The other thing that contributed to performance is the columnar storage format of parquet formatted data. The column values are stored in contiguous blocks which means that there is no padding between blocks of data. This makes for better compression of the data and faster querying. You still have to compare values, but you spend a lot less time seeking the locations of each value in memory or on disk. The combination of parquet and Presto make for one fast query.

A comparison of a filter query (e.g., select <column_name> from <table_name where <column_name > 11–11–2020) shows Presto results compared to other technologies we experimented with:

The switch to Presto has been great for our use case. We can now access our entire datasets instead of all the short-term features and restrictions we built on our data. Our filtering query times have improved greatly. As a team, we continue to tune and make improvements using this query engine and we’re very excited as this speeds the innovation process. We’re really big fans of the Open-Source community and would like to thank members of the Presto community who have helped answer our questions in this journey.

If you haven’t already signed up for our Developer Portal, we encourage you to do so [Link] and if there are any specific questions or opportunities you would like to talk to us about, please click here. We look forward to hearing about your feedback and ideas.

Why our Team chose Presto — A Distributed SQL Query Engine for Big Data

Written by 99P Labs