Benchmarking SQL engines for Data Serving: PrestoDb, Trino, and Redshift

Anton Peniaziev
Apr 12 · 11 min read

In the business of external data enrichment for data science, the main focus is on the ability to provide a fast and scalable way to aggregate, join and match large datasets received from data providers with the customer’s internal data. Enriched datasets derive more features, which are leveraged by customer data science teams, resulting in higher AUC and better predictive power.

This performance comparison has arisen from the very specific engineering problem of enrichments with foot traffic data, collected from mobile devices over a time span of half a year. However because this dataset is large in volume (4.5B records) and uniform (20–25M per day), it provides a generic real-world example for performance testing without adding extra complexity and variables like data-skew or complex relations. The data itself is the result of Spark job ETL and consists of parquet files — 1 file per partition, partitioned by day (~250MB each file), which resides on Amazon S3. A delta-lake table is metadata created in order to enable tabular access via the AWS Glue catalog.

Objective: Given a user dataset with a date column in the format YYYY%MM%%dd and coordinates in the form of H3 geo-hash, enrich their data with records from the 4.5B dataset with schema (local_day, h310), where local_day is the date in the same format as the user’s data, and h310 is Uber’s H3 geospatial index, easily calculated from decimal coordinates.

The user’s inquiry results in a dataset received via an API call, which could range from 1 to 1M rows, and will be referred to as query_table. Our base 4.5B dataset has unique h3’s per day, and half a year’s worth of data, which will be referred to as foottraffic_daily_visitation. The following query performs the required operation and will be benchmarked:

WITH SELECT_QUERY AS (
SELECT
local_day, h310, total_visitors
FROM
sqlperformance.foottraffic_daily_visitation
)
SELECT
SELECT_QUERY.*
FROM
SELECT_QUERY SELECT_QUERY
JOIN
{query_table} TMP_TABLE
ON
SELECT_QUERY.local_day = date(TMP_TABLE.date) AND SELECT_QUERY.h310 = TMP_TABLE.h310
WHERE
SELECT_QUERY.local_day IN ({dates_predicate})

A note regarding the dates_predicate part — it’s a predicate pushdown, supported both in Presto and Redshift, which will significantly reduce the number of partitions scanned in the join. The user’s query is received dynamically and is not known in advance. The solution will include on the fly creation of temporary query_table, but since the content is known before the join query creation we can retrieve dates_predicate and ingest them inside.

The Setup

PrestoDB
We will use EMR 6.2.0 distribution which is Presto 0.238.3. Additional nice-to-have apps are Ganglia 3.7.2 for load monitoring and Hue 4.8.0 for interactive querying. The hardware is 1Xr5.4xlarge, 6Xm5.2xlarge. The JSON configuration we’ve used is:

[
{
"classification":"presto-connector-memory",
"properties":{
"memory.max-data-per-node":"256MB"
},
"configurations":[

]
},
{
"classification":"presto-connector-hive",
"properties":{
"hive.parquet.use-column-names":"true",
"hive.metastore.glue.datacatalog.enabled":"true",
"hive.s3-file-system-type":"PRESTO"
},
"configurations":[

]
},
{
"classification":"presto-config",
"properties":{
"query.max-run-time":"15m",
"task.concurrency":"16",
"query.max-history":"1000",
"experimental.reserved-pool-enabled":"false",
"query.max-concurrent-queries":"12"
},
"configurations":[

]
}
]

The most important part is hive properties that would enable us to utilize the glue catalog. presto-connector-memory and presto-config are presto application-specific properties. When deploying Presto yourself those are defined in memory. properties and config. properties files respectively, but for EMR deployment you need to specify them in configuration json. To better understand how to translate Presto configurations into EMR json refer to this guide. There are also a few ways to reconfigure a running Presto cluster:

There’s a great deal of performance tuning advice in the Presto Definitive Guide book, but for this benchmark, we will stick to out-of-the-box, close to default values.

Trino

Previously known as PrestoSQL, Trino is an open-source version of PrestoDB. It has a vibrant community and a very helpful Slack workspace.

We will use the same EMR version and hardware, but check PrestoSQL 343 application instead of PrestoDB. There are also slight syntactic differences in the json configuration:

[
{
"classification":"prestosql-connector-memory",
"properties":{
"connector.name":"memory",
"memory.max-data-per-node":"256MB"
},
"configurations":[

]
},
{
"classification":"prestosql-connector-hive",
"properties":{
"hive.parquet.use-column-names":"true",
"hive.metastore":"glue",
"hive.s3-file-system-type":"PRESTO"
},
"configurations":[

]
},
{
"classification":"prestosql-config",
"properties":{
"query.max-run-time":"15m",
"task.concurrency":"16",
"query.max-history":"1000",
"experimental.reserved-pool-enabled":"false",
"http-server.process-forwarded":"true",
"query.max-concurrent-queries":"12"
},
"configurations":[

]
}
]

Pay attention to “connector.name”:”memory” which is necessary to enable in-memory table creation (turned on by default in PrestoDb). Also “http-server.process-forwarded”:”true” will help to forward the authorization to Trino in case you are connecting through a proxy.

For both Presto and trino you can access Web-UI which contain useful information about queries, query plans, and runtimes:
<master DNS>:8889/ui

Redshift

The hardware chosen is: 2 nodes ra3.4xlarge. In order to query data on S3 we’ll use the Spectrum feature, which will require adding the permission for S3 and AWS Glue. If you are an IAM user, create a role with the following policies:AmazonDMSRedshiftS3Role, AmazonRedshiftQueryEditor, GlueFullAccessPolicy.

Next, create a cluster with a database named dev and port 5439, then in cluster permission choose the IAM role you’ve created and add it.

In order to enable client connection you need to make sure the cluster can receive TCP requests. If you work above VPC — go to your Security Group settings, check Inbound Rules and add a rule: Type Redshift, Protocol TCP, Port Range 5439, source , then add a specific IP or a wildcard. Sometimes in different subregions the cluster could still be inaccessible, so to fix it go to cluster actions and enable public accessibility.

Now, in order to enable querying S3 you need to create an external schema like so:

create external schema sqlperformance
from data catalog
database ‘sqlperformance’
region ‘eu-west-1’
iam_role ‘arn:aws:iam::<iam role you've created>’;

Python Clients

For this benchmark, we use Pyhive client for both Presto and Trino and psycopg2 for Redshift. A special note has to be made when measuring queries’ runtime with python clients — a query isn’t finished until the client returns ack. That is, be cautious when running queries that induce a lot of data to be transferred through the network. To eliminate this bottleneck we’ve surrounded all our join queries by count(*). Another valid technique is CTAS (create table as). CTAS into the blackhole connector has the advantage of no additional overhead. In spite of all the precautions, it’s always good to run queries from CLI to make sure runtimes are similar to those we receive in python.

Benchmarking

When comparing runtimes in different conditions it’s important not to adjust too many parameters at a time. The great Andrew NG explains the concept of orthogonalization in one of his courses.

In order to minimize the possible effects of caching mechanisms and sporadic network delays each test was run 10 times with randomized unique dates in each run, and the runtime was averaged.

The above experiment is a fixed amount of dates (partition) and a rising amount of rows in each query (10k, 100K, 1 million). All engines are almost indifferent to the amount of rows and Redshift is far behind. Since our use-case requires mostly computational ability out of the engine and not necessarily flexibility to connect to various data lakes (the Connectors architecture is the most valuable feature of Presto/Trino!), we can assume our ETL has transitioned from S3 to writing into Redshift disks directly. In order to simulate that let’s create a local foootraffic_daily_visitation table on the Redshift cluster:

CREATE TABLE foootraffic_daily_visitation
sortkey(local_day)
AS
SELECT * FROM sqlperformance.foottraffic_daily_visitation

sortkey is similar to partitioning in hive. This query ought to take about 12 minutes so don’t run it from an online query editor which has a limitation of 10 minutes. Then we run the same experiment, but now without Spectrum:

Redshift performance improves significantly. In fact spectrum’s performance on the hive table on S3 is similar to performance with an on-disk local table without sortkey specified.

To strengthen the claim that tools are indifferent to the number of rows let’s run a wider range:

Of course, the number of rows means more data, and even though the join runtime doesn’t seem to be influenced, insertion of query_table and data retrieval will be much more affected. Let’s measure the whole cycle — insert+join:

It’s worth mentioning that the query_table insertion method is similar for all tools on the client side — bulk insertion of 10k rows at a time, but under the hood. Presto and Trino have memory connectors which allow us to insert a temporary table into cluster RAM, when in Redshift we write those bulks on SSD. That explains that while Redshift remained faster, its relative slowdown is ~10x whereas Presto’s is about 2x when going from 10K to 100K rows.

Now let’s fix the row number and turn another knob — dates number:

The number of partitions queried seems to be the only factor for join runtime.

In production, our platform serves dozens of users and processes. System response on concurrently running queries should provide an insight into whether initial cluster size is sufficient, and at which point scaling out is necessary. We’ll run the same query size with different random dates simultaneously:

The n_conn suffix stands for n concurrent running instances of a query. As one of the Presto developers states in this video, Presto will always try to utilize all the available machines and resources for each query. One may ask why we don’t observe more than a 2x increase in runtime when doubling the amount of concurring queries, but the answer is of course that 100% utilization is never really achieved.

Simplifying the query

One could argue that for smaller inquiries it’s an unnecessary complication to include join. In fact our initial motivation to query a big table by joining it with temp-table was that there’s a limitation for an sql-query size. Let’s then measure the following where query:

SELECT
local_day, h310, total_visitors,
FROM
sqlperformance.foottraffic_daily_visitation
WHERE
1=0 OR local_day = date '2020-10-15' AND h310 = '8a26e0cb344ffff' OR local_day = date '2020-10-03' AND h310 = '.

The Redshift’s slowdown is especially concerning. If we measure Redshift separately we’ll see it definitely struggles to optimize the where clause and scan only necessary partitions:

In Presto the problem doesn’t seem to be that noticeable. However, each tool may have its own quirks, for example when performing initial benchmarking we’ve tested join with static query_tables that were preloaded in S3 instead of dynamically creating them with random dates. Remember {dates_predicate}? In order to get it, we’ve simply added select distinct(date) from static_query_table. We expected the same predicate pushdown to happen as we have in production, however Trino had some issues with it:

As this post is being written there’s still an ongoing investigation in Trino’s Slack with experienced engineers from Starburst trying to get to the root of the problem.

Pricing

current monthly benchmark costs:

Presto/Trino — $2400

Redshift —$5192

Let’s try to equal costs, by choosing a smaller Redshift cluster: 3 ra3.xlplus nodes which will cost $2596 a month.

The previous dates rising test:

The concurrent test:

As we observe, Redshift’s performance dropped, but it still outperforms Trino and Presto at roughly the same price. In fact, the price could be even less due to different Redshift cluster policies: they charge for compute and storage separately and $2596 is an estimated price when you use almost all the cluster’s 96TB. Plus, great cost reduction could be achieved by committing to use the cluster for 1 or 3 years and paying upfront:

Conclusion

Data serving is a special business case, which demands real-time low-latency on small queries, the ability to scale and withstand abrupt peak loads, and high concurrency. When comparing different engines it’s important to distinguish between pure technical performance on specific queries, the tool’s ability to optimize sql queries submitted by non-technical staff, scaling, and how it all translates into business value. For growing startups a hybrid approach using different engines for different parts of the serving platform could be a viable option, allowing flexibility while building up engineering staff experience with those tools over time.

Next Steps

To introduce metrics that represent the potential peak load on the engine, considering the maximal time that the query can remain queued, and the minimal time needed for provisioning of additional resources.

Anton Peniaziev is an experienced data and machine learning engineer who is a member of the expert data team of Explorium. Explorium offers the industry’s first automated External Data Platform for Advanced Analytics and Machine Learning. Explorium empowers data scientists and business leaders to drive decision-making by eliminating the barrier to acquiring and integrating the right external data and dramatically decreasing the time to superior predictive power. Learn more at www.explorium.ai

Explorium.ai

Thoughts and ideas from the data and data science experts of The External Data Platform

Explorium.ai

Explorium offers a first of its kind end-to-end automated external data platform. Our unique solution automatically connects and integrates the most relevant external data signals out of thousands of sources to optimise and accelerate your ML investment and advanced analytics.

Anton Peniaziev

Written by

Explorium.ai

Explorium offers a first of its kind end-to-end automated external data platform. Our unique solution automatically connects and integrates the most relevant external data signals out of thousands of sources to optimise and accelerate your ML investment and advanced analytics.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store