LISTAGG and Memory Usage in Redshift

John Mastro
Feb 21, 2019 · 3 min read

In Redshift, the type of LISTAGG is varchar(65535), which can cause
large aggregations using it to consume a lot of memory and spill to disk during processing.

Image for post
Image for post

Background

LISTAGG is a convenient way to aggregate a group of values by concatenating
them together. It supports DISTINCT, a delimiter, and a within-group ordering.

We’ll go through a few examples of using LISTAGG and then talk about a
situation we ran into where it resulted in much higher memory usage than you might expect without knowing any of its implementation details.

Given this example table:

Image for post
Image for post

Here are a few examples using LISTAGG:

select
some_id,
listagg(fruit) as fruit_list
from
fruits
group by
some_id
order by
some_id
Image for post
Image for post
select
some_id,
listagg(fruit, ', ') as fruit_list
from
fruits
group by
some_id
order by
some_id
Image for post
Image for post
select
some_id,
listagg(fruit, ', ') within group (order by fruit) as fruit_list
from
fruits
group by
some_id
order by
some_id
Image for post
Image for post
select
some_id,
listagg(distinct fruit, ', ') within group (order by fruit) as fruit_list
from
fruits
group by
some_id
order by
some_id
Image for post
Image for post

We don’t use LISTAGG very often – concatenated strings are usually not a
friendly format to work with in the context of a relational database – but we do use them sometimes when they’re the most useful way to aggregate data and present it to a business user. It’s definitely worth knowing about, but it uses a lot of memory, and therefore probably isn’t appropriate for large
aggregations.

Why does LISTAGG use a lot of memory?

The type of LISTAGG is varchar(65535). In our case, the results of our
LISTAGG expressions were around 10-50 characters long – nowhere near 65,535 characters. The good news is that Redshift optimizes varchar columns when storing them by omitting the implicit trailing blanks. In other words, having over-large varchar columns won’t necessarily blow up your disk usage.

However, during processing, Redshift needs the full width in memory, and this is where we ran into a problem.

In our case, we had a pivot query that returned 1,091,055 rows and
originally included two LISTAGG expressions. At one byte per character, that
would be 2 LISTAGG expressions times 1,091,055 rows times 65,535 bytes, which is 143,004,578,850 bytes, or just over 143 gigabytes.

What were the symptoms?

In the best case, the query exhausted the available memory, started spilling
data to disk, exhausted the disk, and then failed with a “Disk Full” error. At
worst, it contributed to the cluster freezing and becoming unresponsive until
rebooted.

What are the alternatives?

There’s no “LISTAGG but without the high memory usage” to point to, so the
best alternatives will really depend on the circumstances. Possibilities could
include performing the aggregation on demand for smaller result sets, or in a BI tool.

We’ll continue to use LISTAGG from time to time, but we’ll keep the expected
size of the result set in mind when deciding whether it’s appropriate.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store