LISTAGG and Memory Usage in Redshift

John Mastro
Feb 21 · 3 min read

In Redshift, the type of LISTAGG is varchar(65535), which can cause large aggregations using it to consume a lot of memory and spill to disk during processing.

Background

LISTAGG is a convenient way to aggregate a group of values by concatenating them together. It supports DISTINCT, a delimiter, and a within-group ordering.

We’ll go through a few examples of using LISTAGG and then talk about a situation we ran into where it resulted in much higher memory usage than you might expect without knowing any of its implementation details.

Given this example table:

Here are a few examples using LISTAGG:

select
  some_id,
  listagg(fruit) as fruit_list
from
  fruits
group by
  some_id
order by
  some_id
select
  some_id,
  listagg(fruit, ', ') as fruit_list
from
  fruits
group by
  some_id
order by
  some_id
select
  some_id,
  listagg(fruit, ', ') within group (order by fruit) as fruit_list
from
  fruits
group by
  some_id
order by
  some_id
select
  some_id,
  listagg(distinct fruit, ', ') within group (order by fruit) as fruit_list
from
  fruits
group by
  some_id
order by
  some_id

We don’t use LISTAGG very often – concatenated strings are usually not a friendly format to work with in the context of a relational database – but we do use them sometimes when they’re the most useful way to aggregate data and present it to a business user. It’s definitely worth knowing about, but it uses a lot of memory, and therefore probably isn’t appropriate for large aggregations.

Why does LISTAGG use a lot of memory?

The type of LISTAGG is varchar(65535). In our case, the results of our LISTAGG expressions were around 10-50 characters long – nowhere near 65,535 characters. The good news is that Redshift optimizes varchar columns when storing them by omitting the implicit trailing blanks. In other words, having over-large varchar columns won’t necessarily blow up your disk usage.

However, during processing, Redshift needs the full width in memory, and this is where we ran into a problem.

In our case, we had a pivot query that returned 1,091,055 rows and originally included two LISTAGG expressions. At one byte per character, that would be 2 LISTAGG expressions times 1,091,055 rows times 65,535 bytes, which is 143,004,578,850 bytes, or just over 143 gigabytes.

What were the symptoms?

In the best case, the query exhausted the available memory, started spilling data to disk, exhausted the disk, and then failed with a “Disk Full” error. At worst, it contributed to the cluster freezing and becoming unresponsive until rebooted.

What are the alternatives?

There’s no “LISTAGG but without the high memory usage” to point to, so the best alternatives will really depend on the circumstances. Possibilities could include performing the aggregation on demand for smaller result sets, or in a BI tool.

We’ll continue to use LISTAGG from time to time, but we’ll keep the expected size of the result set in mind when deciding whether it’s appropriate.

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

John Mastro

Written by

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science