Removing duplicates in a PostgreSQL array

Moof Mayeda
Jul 17, 2017 · 4 min read

Our Ruby on Rails app uses a PostgreSQL database and most of the time, we can run the queries we need using the methods provided by Rails. But I recently had a request for a report of all of our songs, with their genres and keywords, that I had to write in raw SQL using Heroku dataclips.

A song has many versions

The songs report needed to display a bunch of information for each song, but listing out the genres in particular was tricky. What made it tricky? Well, each song has many versions. And each of those versions has a list of genres associated with it stored as a varchar (for example, Blues & Jazz, RnB & Soul, Vintage” in any order). So a different version of the same song might have a genre_list of “RnB & Soul, Beats, Vintage”. In the final report, the song’s genres should be “Blues & Jazz, RnB & Soul, Vintage, Beats”: a de-duped amalgamation of both versions’ genres. So I needed to gather up all the genres associated with each version of the song and remove the duplicates. In Ruby, this would be easy to do with versions.map{|v| v.genre_list.split(', ')}.flatten.uniq. Now normally in SQL this would be the job of a the DISTINCT keyword. But DISTINCT doesn’t see the genres as lists, per se, just as strings, which are in fact distinct from one other.

Step 1: Getting a useable list of genres

So the first step was to get the list of genres for each version in a more useable form (that is to say, not as a string). I turned it into an array using STRING_TO_ARRAY, providing a comma as the delimiter.

STRING_TO_ARRAY(versions.genre_list, ', ')

Step 2: Unnesting (here’s where it gets interesting!)

There isn’t really an easy way to de-dup an array of strings. There is an intarray extension that works for integers, but that doesn’t help here. If only there was a way to use DISTINCT now that we’ve got the genres all in one place…

DISTINCT works on columns, not on an array. So how can we turn that array on its side? This is where UNNEST comes into play. UNNEST expands the array into a set of rows for each version. Remember, these are still being grouped by song_id.

song_id         genre
______________ ______________
1 "Blues & Jazz"
1 "RnB & Soul"
1 "Vintage"
1 "RnB & Soul"
1 "Beats"
1 "Vintage"

You’re probably wondering what the whole query looks like at this point. Here it is:

SELECT
song_id,
genre
FROM
versions v,
UNNEST(STRING_TO_ARRAY(v.genre_list, ', ')) AS genre

Step 3: DISTINCT comes back into the picture

If we use what we’ve got as a subquery, we can use DISTINCT on column genre (Mind blown)!!! Remember, the goal is to have a column in this report that is a comma-separated list of de-duped genres. To use our running example, the genres column for Song ID 1 would be Blues & Jazz, RnB & Soul, Vintage, Beats.

The great news is that STRING_AGG can be used in combination with DISTINCT to simultaneously get only distinct values in column genre and also collapse them into one comma-separated string, like so STRING_AGG(DISTINCT genre, ', ') .

SELECT
song_id, STRING_AGG(DISTINCT genres.genre, ', ') AS genres
FROM (
SELECT
song_id,
genre
FROM
versions v,
UNNEST(STRING_TO_ARRAY(v.genre_list, ', ')) AS genre
) AS genres
GROUP BY genres.song_id;

By this point I could return a table listing the genres by song, with a song_id column and a genres column.

song_id         genres
______________ ______________
1 "Blues & Jazz, RnB & Soul, Vintage, Beats"
2 "Folk, Rock"
3 "Dance, Electronic, Pop, Rock"
4 "Ambient, Beats, Rock"
1 "Blues & Jazz"

Step 4: The final touches

The final step was to join this to the songs table as another subquery so that I could include other song-related columns, like title. The following has been edited for clarity:

SELECT
s.id,
s.title,
genres_by_song.genres
FROM songs s
LEFT JOIN (
SELECT
song_id,
STRING_AGG(DISTINCT genres.genre, ', ') AS genres
FROM (
SELECT
song_id,
genre
FROM
versions v,
UNNEST(STRING_TO_ARRAY(v.genre_list, ', ')) AS genre
) AS genres
GROUP BY genres.song_id
) AS genres_by_song
ON s.id = genres_by_song.song_id
GROUP BY s.id, genres_by_song.genres

Lessons Learned

When I was tasked with creating this report, I thought it would be simple. I didn’t realize how difficult it would be just because of how the data was stored in the database (as a varchar) instead of something more list-like. In the end, though, I had fun figuring out the right order and combination of data transformation and translocation to make it all work. Sometimes the only way to make it work is two steps forward, one step back, and that’s ok.

*Updated on August 31, 2017. Special thanks to Torsten Grust, who pointed out that the use of UNNEST in a SELECT clause is deprecated.

Moof Mayeda

Written by

Organizer turned software engineer. I read+write about programming and social justice.

MarmoLabs

MarmoLabs

Shared discoveries, patterns, and how-to’s from the Marmoset software team to you.

Moof Mayeda

Written by

Organizer turned software engineer. I read+write about programming and social justice.

MarmoLabs

MarmoLabs

Shared discoveries, patterns, and how-to’s from the Marmoset software team to you.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store