Removing duplicates in a PostgreSQL array

Published in

MarmoLabs

4 min readJul 17, 2017

Our Ruby on Rails app uses a PostgreSQL database and most of the time, we can run the queries we need using the methods provided by Rails. But I recently had a request for a report of all of our songs, with their genres and keywords, that I had to write in raw SQL using Heroku dataclips.

A song has many versions

The songs report needed to display a bunch of information for each song, but listing out the genres in particular was tricky. What made it tricky? Well, each song has many versions. And each of those versions has a list of genres associated with it stored as a varchar (for example, “Blues & Jazz, RnB & Soul, Vintage” in any order). So a different version of the same song might have a genre_list of “RnB & Soul, Beats, Vintage”. In the final report, the song’s genres should be “Blues & Jazz, RnB & Soul, Vintage, Beats”: a de-duped amalgamation of both versions’ genres. So I needed to gather up all the genres associated with each version of the song and remove the duplicates. In Ruby, this would be easy to do with versions.map{|v| v.genre_list.split(', ')}.flatten.uniq. Now normally in SQL this would be the job of a the DISTINCT keyword. But DISTINCT doesn’t see the genres as lists, per se, just as strings, which are in fact distinct from one other.

Step 1: Getting a useable list of genres

So the first step was to get the list of genres for each version in a more useable form (that is to say, not as a string). I turned it into an array using STRING_TO_ARRAY, providing a comma as the delimiter.

STRING_TO_ARRAY(versions.genre_list, ', ')

Step 2: Unnesting (here’s where it gets interesting!)

There isn’t really an easy way to de-dup an array of strings. There is an intarray extension that works for integers, but that doesn’t help here. If only there was a way to use DISTINCT now that we’ve got the genres all in one place…

DISTINCT works on columns, not on an array. So how can we turn that array on its side? This is where UNNEST comes into play. UNNEST expands the array into a set of rows for each version. Remember, these are still being grouped by song_id.

song_id         genre
______________  ______________1               "Blues & Jazz"
1               "RnB & Soul"
1               "Vintage"
1               "RnB & Soul"
1               "Beats"
1               "Vintage"

You’re probably wondering what the whole query looks like at this point. Here it is:

SELECT
  song_id,
  genre
FROM
  versions v,
  UNNEST(STRING_TO_ARRAY(v.genre_list, ', ')) AS genre

Step 3: DISTINCT comes back into the picture

If we use what we’ve got as a subquery, we can use DISTINCT on column genre (Mind blown)!!! Remember, the goal is to have a column in this report that is a comma-separated list of de-duped genres. To use our running example, the genres column for Song ID 1 would be Blues & Jazz, RnB & Soul, Vintage, Beats.

The great news is that STRING_AGG can be used in combination with DISTINCT to simultaneously get only distinct values in column genre and also collapse them into one comma-separated string, like so STRING_AGG(DISTINCT genre, ', ') .

SELECT
  song_id, STRING_AGG(DISTINCT genres.genre, ', ') AS genres
FROM (
  SELECT
    song_id,
    genre
  FROM
    versions v,
    UNNEST(STRING_TO_ARRAY(v.genre_list, ', ')) AS genre
) AS genres
GROUP BY genres.song_id;

By this point I could return a table listing the genres by song, with a song_id column and a genres column.

song_id         genres
______________  ______________1               "Blues & Jazz, RnB & Soul, Vintage, Beats"
2               "Folk, Rock"
3               "Dance, Electronic, Pop, Rock"
4               "Ambient, Beats, Rock"
1               "Blues & Jazz"

Step 4: The final touches

The final step was to join this to the songs table as another subquery so that I could include other song-related columns, like title. The following has been edited for clarity:

SELECT
  s.id,
  s.title,
  genres_by_song.genres
FROM songs s
LEFT JOIN (
  SELECT
    song_id,
    STRING_AGG(DISTINCT genres.genre, ', ') AS genres
  FROM (
    SELECT
      song_id,
      genre
    FROM
      versions v,
      UNNEST(STRING_TO_ARRAY(v.genre_list, ', ')) AS genre
  ) AS genres
  GROUP BY genres.song_id
) AS genres_by_song
ON s.id = genres_by_song.song_id
GROUP BY s.id, genres_by_song.genres

Lessons Learned

When I was tasked with creating this report, I thought it would be simple. I didn’t realize how difficult it would be just because of how the data was stored in the database (as a varchar) instead of something more list-like. In the end, though, I had fun figuring out the right order and combination of data transformation and translocation to make it all work. Sometimes the only way to make it work is two steps forward, one step back, and that’s ok.

*Updated on August 31, 2017. Special thanks to Torsten Grust, who pointed out that the use of UNNEST in a SELECT clause is deprecated.