A clever trick— finding duplicates
Basic and intermediate SQL topics with the cosmos
Doubles, twins, and duplicates exist in this world — in fact, the universe, too. Binary and multiple star systems exist. For example, Mizar, a star in the Big Dipper’s, handle is a quadruple star system. Can you imagine seeing multiple sunrises and sunsets?
Take this a step further — suppose you have a dataset and want to check if it has duplicate records. This is sometimes referred to as de-duplicating your data. How is this accomplished for a SQL RDBMS table? Read on to find out a clever way to attack this problem!
The data behind the examples
I am going to use a database table that I created called solar_system_20.
Here’s some metadata about the table’s columns:
ss_name
— the unique name of the Solar System objectdiameter
— the equatorial width of the object (in miles)classification
— the type of object in the record (like star or planet)ss_location
— the location of the object in the Solar System
This table has facts about the 20 largest objects in the Solar System. I’ve queried this table multiple times for various SQL tutorials:
These tutorials show hands-on examples of aggregate functions, subqueries, aliases, and the GROUP BY
and HAVING
commands in SQL.
Let’s view this table with a simple SELECT
query to obtain the first five records.
Ok, now we have a visual understanding of this table.
Finding duplicates
An important tool to have in you SQL toolbox is to be able to identify duplicates in a table. This can be as simple as looking for records that have the same name or id. Or, it can be more complex, like below.
Suppose we want to find the classifications in solar_system_20
that have more than one record. For example, with this query we can find all the classifications and records.
SELECT classification
, count(*)
FROM solar_system_20
GROUP BY classification
Result:
We can use an aggregate function with HAVING
to limit the output and find duplicate classifications — meaning, the classifications with more than one record.
SELECT classification
, count(*)
FROM solar_system_20
GROUP BY classification
HAVING count(*) >= 2;
Result:
Concatenate Strings
By the way, another cool and unique tool is the ability to concatenate strings in the table. Take a look at this.
SELECT
CONCAT_WS(' -- ', ss_name, classification) ss_info
FROM solar_system_20
ORDER BY classification;
Result:
This query combined all the record’s names with it’s classification and returned a view with just this new combination named ss_info
.
There you have it, folks! More SQL tools. For your reference, I have a series of SQL tutorials that cover basic and intermediate queries. Take a look!
- Black holes, planets, and SQL
- Touring the Solar System with SQL
- Searching for moons with SQL
- Classifying black holes with SQL
- Joining Constellations with SQL
- Solar System classifications with SQL
- Many (to many) relations among the stars
- Creating tables in SQL
- A window to the solar system
Like these SQL posts? Have a question? Let’s chat in the comments!