Removing duplicates in a PostgreSQL array

Moof Mayeda
Jul 17, 2017 · 4 min read

Our Ruby on Rails app uses a PostgreSQL database and most of the time, we can run the queries we need using the methods provided by Rails. But I recently had a request for a report of all of our songs, with their genres and keywords, that I had to write in raw SQL using Heroku dataclips.

A song has many versions

The songs report needed to display a bunch of information for each song, but listing out the genres in particular was tricky. What made it tricky? Well, each song has many versions. And each of those versions has a list of genres associated with it stored as a varchar (for example, Blues & Jazz, RnB & Soul, Vintage” in any order). So a different version of the same song might have a genre_list of “RnB & Soul, Beats, Vintage”. In the final report, the song’s genres should be “Blues & Jazz, RnB & Soul, Vintage, Beats”: a de-duped amalgamation of both versions’ genres. So I needed to gather up all the genres associated with each version of the song and remove the duplicates. In Ruby, this would be easy to do with versions.map{|v| v.genre_list.split(', ')}.flatten.uniq. Now normally in SQL this would be the job of a the DISTINCT keyword. But DISTINCT doesn’t see the genres as lists, per se, just as strings, which are in fact distinct from one other.

Step 1: Getting a useable list of genres

So the first step was to get the list of genres for each version in a more useable form (that is to say, not as a string). I turned it into an array using STRING_TO_ARRAY, providing a comma as the delimiter.

STRING_TO_ARRAY(versions.genre_list, ', ')

Step 2: Unnesting (here’s where it gets interesting!)

There isn’t really an easy way to de-dup an array of strings. There is an intarray extension that works for integers, but that doesn’t help here. If only there was a way to use DISTINCT now that we’ve got the genres all in one place…

DISTINCT works on columns, not on an array. So how can we turn that array on its side? This is where UNNEST comes into play. UNNEST expands the array into a set of rows for each version. Remember, these are still being grouped by song_id.

song_id         genre
______________ ______________
1 "Blues & Jazz"
1 "RnB & Soul"
1 "Vintage"
1 "RnB & Soul"
1 "Beats"
1 "Vintage"

You’re probably wondering what the whole query looks like at this point. Here it is:

SELECT
song_id,
genre
FROM
versions v,
UNNEST(STRING_TO_ARRAY(v.genre_list, ', ')) AS genre

Step 3: DISTINCT comes back into the picture

If we use what we’ve got as a subquery, we can use DISTINCT on column genre (Mind blown)!!! Remember, the goal is to have a column in this report that is a comma-separated list of de-duped genres. To use our running example, the genres column for Song ID 1 would be Blues & Jazz, RnB & Soul, Vintage, Beats.

The great news is that STRING_AGG can be used in combination with DISTINCT to simultaneously get only distinct values in column genre and also collapse them into one comma-separated string, like so STRING_AGG(DISTINCT genre, ', ') .

SELECT
song_id, STRING_AGG(DISTINCT genres.genre, ', ') AS genres
FROM (
SELECT
song_id,
genre
FROM
versions v,
UNNEST(STRING_TO_ARRAY(v.genre_list, ', ')) AS genre
) AS genres
GROUP BY genres.song_id;

By this point I could return a table listing the genres by song, with a song_id column and a genres column.

song_id         genres
______________ ______________
1 "Blues & Jazz, RnB & Soul, Vintage, Beats"
2 "Folk, Rock"
3 "Dance, Electronic, Pop, Rock"
4 "Ambient, Beats, Rock"
1 "Blues & Jazz"

Step 4: The final touches

The final step was to join this to the songs table as another subquery so that I could include other song-related columns, like title. The following has been edited for clarity:

SELECT
s.id,
s.title,
genres_by_song.genres
FROM songs s
LEFT JOIN (
SELECT
song_id,
STRING_AGG(DISTINCT genres.genre, ', ') AS genres
FROM (
SELECT
song_id,
genre
FROM
versions v,
UNNEST(STRING_TO_ARRAY(v.genre_list, ', ')) AS genre
) AS genres
GROUP BY genres.song_id
) AS genres_by_song
ON s.id = genres_by_song.song_id
GROUP BY s.id, genres_by_song.genres

Lessons Learned

When I was tasked with creating this report, I thought it would be simple. I didn’t realize how difficult it would be just because of how the data was stored in the database (as a varchar) instead of something more list-like. In the end, though, I had fun figuring out the right order and combination of data transformation and translocation to make it all work. Sometimes the only way to make it work is two steps forward, one step back, and that’s ok.

*Updated on August 31, 2017. Special thanks to Torsten Grust, who pointed out that the use of UNNEST in a SELECT clause is deprecated.

MarmoLabs

Shared discoveries, patterns, and how-to’s from the Marmoset software team to you.

Moof Mayeda

Written by

Organizer turned software engineer. I read+write about programming and social justice.

MarmoLabs

MarmoLabs

Shared discoveries, patterns, and how-to’s from the Marmoset software team to you.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade