A clever trick— finding duplicates

Basic and intermediate SQL topics with the cosmos

Karen Warmbein
3 min readJul 12, 2020
Image credit: CESAR @ ESA

Doubles, twins, and duplicates exist in this world — in fact, the universe, too. Binary and multiple star systems exist. For example, Mizar, a star in the Big Dipper’s, handle is a quadruple star system. Can you imagine seeing multiple sunrises and sunsets?

Take this a step further — suppose you have a dataset and want to check if it has duplicate records. This is sometimes referred to as de-duplicating your data. How is this accomplished for a SQL RDBMS table? Read on to find out a clever way to attack this problem!

The data behind the examples

I am going to use a database table that I created called solar_system_20. Here’s some metadata about the table’s columns:

  • ss_name — the unique name of the Solar System object
  • diameter— the equatorial width of the object (in miles)
  • classification — the type of object in the record (like star or planet)
  • ss_location — the location of the object in the Solar System

This table has facts about the 20 largest objects in the Solar System. I’ve queried this table multiple times for various SQL tutorials:

These tutorials show hands-on examples of aggregate functions, subqueries, aliases, and the GROUP BY and HAVING commands in SQL.

Let’s view this table with a simple SELECT query to obtain the first five records.

Ok, now we have a visual understanding of this table.

Finding duplicates

An important tool to have in you SQL toolbox is to be able to identify duplicates in a table. This can be as simple as looking for records that have the same name or id. Or, it can be more complex, like below.

Suppose we want to find the classifications in solar_system_20 that have more than one record. For example, with this query we can find all the classifications and records.

SELECT  classification
, count(*)
FROM solar_system_20
GROUP BY classification

Result:

We can use an aggregate function with HAVING to limit the output and find duplicate classifications — meaning, the classifications with more than one record.

SELECT  classification
, count(*)
FROM solar_system_20
GROUP BY classification
HAVING count(*) >= 2;

Result:

Concatenate Strings

By the way, another cool and unique tool is the ability to concatenate strings in the table. Take a look at this.

SELECT
CONCAT_WS(' -- ', ss_name, classification) ss_info
FROM solar_system_20
ORDER BY classification;

Result:

This query combined all the record’s names with it’s classification and returned a view with just this new combination named ss_info.

There you have it, folks! More SQL tools. For your reference, I have a series of SQL tutorials that cover basic and intermediate queries. Take a look!

Like these SQL posts? Have a question? Let’s chat in the comments!

--

--