In Redshift, the type of LISTAGG is
varchar(65535), which can cause
large aggregations using it to consume a lot of memory and spill to disk during processing.
LISTAGG is a convenient way to aggregate a group of values by concatenating
them together. It supports
DISTINCT, a delimiter, and a within-group ordering.
We’ll go through a few examples of using
LISTAGG and then talk about a
situation we ran into where it resulted in much higher memory usage than you might expect without knowing any of its implementation details.
Given this example table:
Here are a few examples using
select some_id, listagg(fruit) as fruit_list from fruits group by some_id order by some_id
select some_id, listagg(fruit, ', ') as fruit_list from fruits group by some_id order by some_id
select some_id, listagg(fruit, ', ') within group (order by fruit) as fruit_list from fruits group by some_id order by some_id
select some_id, listagg(distinct fruit, ', ') within group (order by fruit) as fruit_list from fruits group by some_id order by some_id
We don’t use
LISTAGG very often – concatenated strings are usually not a
friendly format to work with in the context of a relational database – but we do use them sometimes when they’re the most useful way to aggregate data and present it to a business user. It’s definitely worth knowing about, but it uses a lot of memory, and therefore probably isn’t appropriate for large
Why does LISTAGG use a lot of memory?
The type of
varchar(65535). In our case, the results of our
LISTAGG expressions were around 10-50 characters long – nowhere near 65,535 characters. The good news is that Redshift optimizes
varchar columns when storing them by omitting the implicit trailing blanks. In other words, having over-large
varchar columns won’t necessarily blow up your disk usage.
However, during processing, Redshift needs the full width in memory, and this is where we ran into a problem.
In our case, we had a pivot query that returned 1,091,055 rows and
originally included two
LISTAGG expressions. At one byte per character, that
would be 2
LISTAGG expressions times 1,091,055 rows times 65,535 bytes, which is 143,004,578,850 bytes, or just over 143 gigabytes.
What were the symptoms?
In the best case, the query exhausted the available memory, started spilling data to disk, exhausted the disk, and then failed with a “Disk Full” error. At worst, it contributed to the cluster freezing and becoming unresponsive until rebooted.
What are the alternatives?
There’s no “
LISTAGG but without the high memory usage” to point to, so the
best alternatives will really depend on the circumstances. Possibilities could
include performing the aggregation on demand for smaller result sets, or in a BI tool.
We’ll continue to use
LISTAGG from time to time, but we’ll keep the expected
size of the result set in mind when deciding whether it’s appropriate.