The many faces of DISTINCT in PostgreSQL

3 powerful uses of DISTINCT in PostgreSQL.

Haki Benita
Statuscode
7 min readMay 11, 2017

--

For a better reading experience, check out this article on my website.

I started my programming career as an Oracle DBA. It took a few years but eventually I got fed up with the corporate world and I went about doing my own thing.

When I no longer had the comfy cushion of Oracle enterprise edition I discovered PostgreSQL. After I gotten over not having proper partitions and MERGE statement (aka UPSERT) I found some nice unique features in PostgreSQL. Oddly enough, a lot of them contained the word DISTINCT.

Duplication is scary (The shining, 1980)

DISTINCT

I created a simple Employee table with name, department and salary using mock data from this site:

What is DISTINCT?

SELECT DISTINCT eliminates duplicate rows from the result.

The simplest use of distinct is, for example, to get a unique list of departments:

(easy CS students, I know it’s not normalized…)

We can do the same thing with group by

but we are talking about DISTINCT.

DISTINCT ON

A classic job interview question is finding the employee with the highest salary in each department.

This is what they teach in the university:

Legal has two employees with the same high salary. Depending on the use case, this query can get pretty nasty.

If you graduated a while back, you already know a few things about databases and you heard about analytic and window functions, you might do this:

The result is the same without the duplicates:

Up until now, this is what I would have done.

Now for the real treat, PostgreSQL has a special nonstandard clause to find the first row in a group:

This is wild! Why nobody told me this is possible?

The docs explain DISTINCT ON:

SELECT DISTINCT ON ( expression [, …] ) keeps only the first row of each set of rows where the given expressions evaluate to equal

And the reason I haven’t heard about it is:

Nonstandard Clauses
DISTINCT ON ( … ) is an extension of the SQL standard.

PostgreSQL does all the heavy lifting for us. The only requirement is that we ORDER BY the field we group by (department in this case). It also allows for “grouping” by more than one field which only makes this clause even more powerful.

IS DISTINCT FROM

Comparing values in SQL can result in three outcomes — true, false or unknown:

The result of comparing NULL with NULL using equality (=) is UNKNOWN (marked as NULL in the table).

In SQL 1 = 1 and NULL IS NULL but NULL != NULL.

It’s important to be aware of this subtlety because comparing nullable fields might yield unexpected results.

The full condition to get either true or false when comparing nullable fields is:

And the result:

This is the result we want but it is very long. Is there a better way?

PostgreSQL implements the SQL standard for safely comparing nullable fields:

PostgreSQL wiki explain IS DISTINCT FROM:

IS DISTINCT FROM and IS NOT DISTINCT FROM … treat NULL as if it was a known value, rather than a special case for unknown.

Much better — short and verbose.

How other databases handle this?

  • MySQL — a special operator <=> with similar functionality.
  • Oracle — Provide a function called LNNVL to compare nullable fields (good luck with that…).
  • MSSQL — Couldn’t find a similar function.

ARRAY_AGG (DISTINCT)

ARRAY_AGG was one of the major selling points of PostgreSQL when I was transitioning from Oracle.

ARRAY_AGG aggregates values into an array:

I find ARRAY_AGG useful mostly in the CLI for getting a quick view of the data, or when used with an ORM.

PostgreSQL took it the extra mile and implemented the DISTINCT option for this aggregate function as well. Using DISTINCT we can, for example, quickly view the unique salaries in each department:

We can immediately see that everyone in the support department are making the same salary.

How other databases handle this?

  • MySQL — Has a similar function called GROUP_CONCAT.
  • Oracle — Has an aggregate function called ListAgg. It has no support for DISTINCT. Oracle introduced the function in version 11.2 and up until then the world wide web was filled with custom implementations.
  • MsSQL — The closest I found was a function called STUFF that accepts an expression.

Take away

The only take away from this article is that you should always go back to the basics!

--

--

Haki Benita
Statuscode

Full Stack Developer, Team Leader, Independent. More from me at https://hakibenita.com