LISTAGG and Memory Usage in Redshift

In Redshift, the type of LISTAGG is varchar(65535), which can cause
large aggregations using it to consume a lot of memory and spill to disk during processing.

Background

LISTAGG is a convenient way to aggregate a group of values by concatenating
them together. It supports DISTINCT, a delimiter, and a within-group ordering.

We’ll go through a few examples of using LISTAGG and then talk about a
situation we ran into where it resulted in much higher memory usage than you might expect without knowing any of its implementation details.

Given this example table:

Here are a few examples using LISTAGG:

select
some_id,
listagg(fruit) as fruit_list
from
fruits
group by
some_id
order by
some_id
select
some_id,
listagg(fruit, ', ') as fruit_list
from
fruits
group by
some_id
order by
some_id
select
some_id,
listagg(fruit, ', ') within group (order by fruit) as fruit_list
from
fruits
group by
some_id
order by
some_id
select
some_id,
listagg(distinct fruit, ', ') within group (order by fruit) as fruit_list
from
fruits
group by
some_id
order by
some_id

We don’t use LISTAGG very often – concatenated strings are usually not a
friendly format to work with in the context of a relational database – but we do use them sometimes when they’re the most useful way to aggregate data and present it to a business user. It’s definitely worth knowing about, but it uses a lot of memory, and therefore probably isn’t appropriate for large
aggregations.

Why does LISTAGG use a lot of memory?

The type of LISTAGG is varchar(65535). In our case, the results of our
LISTAGG expressions were around 10-50 characters long – nowhere near 65,535 characters. The good news is that Redshift optimizes varchar columns when storing them by omitting the implicit trailing blanks. In other words, having over-large varchar columns won’t necessarily blow up your disk usage.

However, during processing, Redshift needs the full width in memory, and this is where we ran into a problem.

In our case, we had a pivot query that returned 1,091,055 rows and
originally included two LISTAGG expressions. At one byte per character, that
would be 2 LISTAGG expressions times 1,091,055 rows times 65,535 bytes, which is 143,004,578,850 bytes, or just over 143 gigabytes.

What were the symptoms?

In the best case, the query exhausted the available memory, started spilling
data to disk, exhausted the disk, and then failed with a “Disk Full” error. At
worst, it contributed to the cluster freezing and becoming unresponsive until
rebooted.

What are the alternatives?

There’s no “LISTAGG but without the high memory usage” to point to, so the
best alternatives will really depend on the circumstances. Possibilities could
include performing the aggregation on demand for smaller result sets, or in a BI tool.

We’ll continue to use LISTAGG from time to time, but we’ll keep the expected
size of the result set in mind when deciding whether it’s appropriate.