10 Tips For Presto Query Performance Optimization

Seals_XYZ
4 min readSep 4, 2021

1. Filter by partition column

Large fact tables are usually stored as lots of files and directories, and partitioned by a date column such as date_str. Most queries only need to read the the data from the last few days. However, the Presto will read all the data for that table if the user has not specified the partition column as the filter condition. In this way, the Presto will have bad performance.

WRONG:

select count(*)
from pin.trips
where request_timestamp >= CURRENT_DATE - INTERVAL '3' DAY

CORRECT:

select count(*)
from pin.trips
where date(detestr) >= CURRENT_DATE - INTERVAL '3' DAY

2. Limit query results

If you only want to see sample data, please restrict the results to a small number of rows by limit syntax. Otherwise, the queries will return large results and will lead to unnecessary system resource consumption.

select * from pin.trips where datestr='2020-10-11' limit 100;

3. Reduce the number of group by columns

Reducing the number of group by columns can help to save memory usage significantly. Below query will take up lots of memory for shuffling.

select c1, c2, c3, c4, max(c5)
from pin.trips
where datestr='2021-07-08'
group by c1, c2, c3, c4;

4. Using the approximate when possible

Usually, we like to use the count distinct to calculate the daily active user. e.g.

select count(distinct user_id) from
pin.trips where datestr='2021-10-01'

However, to get the count distinct value, we have to fit all non-distinct tuples into the memory of a single worker node, which requires a lot of memory and memory usage increases very quickly with the number of columns in the result set. In this case, the approximate method comes to help.

select approx_distinct(user_id)
from
pin.trips where datestr='2021-10-01'

The above approximate sql runs much faster and uses very little memory when compared to the previous query. approx_distinct returns an approximate count of matching values, standard error of 2.3%.

For details, please check here.

5. Replace multiple LIKE clauses by regexp_like

Presto query optimizer is unable to improve queries with many LIKE clauses, and the query execution can be very slow. Using the regex_like to replace multiple LIKE clauses chained with OR will have better performance.

WRONG:

select c1
from pin.trips
where user_id like '%usr1%'
or user_id like '%usr2%'
or user_id like '%usr3%'
or user_id like '%usr4%'

CORRECT:

select c1
from pin.trips
where user_id regexp_like(user_id, 'usr1|usr2|usr3|usr4');

6. Putting the large fact tables earlier in join order

Some Presto versions do not optimize join order of tables, in this case, putting the large fact tables earlier in join order can help to improve performance.

WRONG:

select b.country_name, sum(a.spend)
from small_dim_table b
join large_fact_table a
on a.city_id=b.city_id
where a.datestr='2021-06-05' group by 1;

CORRECT:

select b.country_name, sum(a.spend)
from large_fact_table a
join small_dim_table b
on a.city_id=b.city_id
where a.datestr='2021-06-05' group by 1;

7. UNION ALL instead of UNION

UNION operation will remove the duplicate records. Just like the distinct, the UNION will require large amounts of memory on a single worker node.

WRONG:

select c1, c2 from pin.table1
union
select c1, c2 from pin.table2

CORRECT:

select c1, c2 from pin.table1
union all
select c1, c2 from pin.table2

8. No expression in the join column

If the join condition is an expression, the join operation could be very slow.

WRONG:

select a.datestr, b.name
from table1 a
join table2 b
on a.col1= cast((b.col1*1000 + b.col2*100) as varchar)

CORRECT:

select a.datestr, b.name
from table1 a
join
(select name, cast((b.col1*1000 + b.col2*100) as varchar) col3 from table2) b
on a.col1= b.col3

9. USE VALUES() for side data

WRONG:

SELECT
CASE WHEN id in (10,20) then 'val1'
WHEN id in (30,40) then 'val2'
WHEN id in (50) then 'val3'
ELSE 'default'
END as val
from pin.trips
where datestr='2021-06-05';

with geo as (
select 10 as city_id, 'sf' as city_name, 'center' as geo_nameunion allselect 20 as city_id, 'sd' as city_name, 'center' as geo_name)

The above query will lead to a complex query plan and will lead to exceptions if the case statement has lots of options. We can use the VALUES() operator to improve the performance.

CORRECT:

with vals as (
select * from (
values(10, 'val1'),
values(20, 'val1'),
values(30, 'val2'),
values(40, 'val2'),
values(50, 'val3'),
) as t (id, val))select if (t.val is not null, t.val, 'default') as val
from pin.trips t
left join vals on t.id = vals.id
where datestr='2021-06-05';
with geo as (select * from (values(10, 'SF', 'CENTER'),
values(20, 'SD', 'CENTER'),
) as g (city_id, city_name, geo_name))

10. CROSS JOIN

Avoid joins that generate more outputs than inputs. Please pre-aggregate your data if there is a cross join.

Each row from the first table is joined to every row in the second table. We may have m*n rows if we have m rows in the left table and n rows in the right table. In this way, we probably have the OOM issue.

You will have cross join for below use cases.

  1. using cross join syntax
  2. Join without on condition will also result in CROSS join. The same as JOIN with conditional ON clause.
select * from table1 t1
join table t2 on
case when t1.c1 = 'x'
then t1.a=t2.b
else t1.a=t2.c
end

Recommended solutions

  1. Use the group by to pre-aggregate the data
  2. use the window function which is more efficient than a cross join.

--

--