LISTAGG and Memory Usage in Redshift

John Mastro
Feb 21 · 3 min read

In Redshift, the type of LISTAGG is varchar(65535), which can cause
large aggregations using it to consume a lot of memory and spill to disk during processing.

Background

LISTAGG is a convenient way to aggregate a group of values by concatenating
them together. It supports DISTINCT, a delimiter, and a within-group ordering.

We’ll go through a few examples of using LISTAGG and then talk about a
situation we ran into where it resulted in much higher memory usage than you might expect without knowing any of its implementation details.

Given this example table:

Here are a few examples using LISTAGG:

select
some_id,
listagg(fruit) as fruit_list
from
fruits
group by
some_id
order by
some_id
select
some_id,
listagg(fruit, ', ') as fruit_list
from
fruits
group by
some_id
order by
some_id
select
some_id,
listagg(fruit, ', ') within group (order by fruit) as fruit_list
from
fruits
group by
some_id
order by
some_id
select
some_id,
listagg(distinct fruit, ', ') within group (order by fruit) as fruit_list
from
fruits
group by
some_id
order by
some_id

We don’t use LISTAGG very often – concatenated strings are usually not a
friendly format to work with in the context of a relational database – but we do use them sometimes when they’re the most useful way to aggregate data and present it to a business user. It’s definitely worth knowing about, but it uses a lot of memory, and therefore probably isn’t appropriate for large
aggregations.

Why does LISTAGG use a lot of memory?

The type of LISTAGG is varchar(65535). In our case, the results of our
LISTAGG expressions were around 10-50 characters long – nowhere near 65,535 characters. The good news is that Redshift optimizes varchar columns when storing them by omitting the implicit trailing blanks. In other words, having over-large varchar columns won’t necessarily blow up your disk usage.

However, during processing, Redshift needs the full width in memory, and this is where we ran into a problem.

In our case, we had a pivot query that returned 1,091,055 rows and
originally included two LISTAGG expressions. At one byte per character, that
would be 2 LISTAGG expressions times 1,091,055 rows times 65,535 bytes, which is 143,004,578,850 bytes, or just over 143 gigabytes.

What were the symptoms?

In the best case, the query exhausted the available memory, started spilling
data to disk, exhausted the disk, and then failed with a “Disk Full” error. At
worst, it contributed to the cluster freezing and becoming unresponsive until
rebooted.

What are the alternatives?

There’s no “LISTAGG but without the high memory usage” to point to, so the
best alternatives will really depend on the circumstances. Possibilities could
include performing the aggregation on demand for smaller result sets, or in a BI tool.

We’ll continue to use LISTAGG from time to time, but we’ll keep the expected
size of the result set in mind when deciding whether it’s appropriate.

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

John Mastro

Written by

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade