What are Duplicate Records in SQL

Published in

365 Data Science

5 min readAug 27, 2018

Duplicate records in SQL, also known as duplicate rows, are identical rows in an SQL table. This means, for a pair of duplicate records, the values in each column coincide. Usually, we will retrieve duplicate data, when we are joining tables. So, if you want to learn a thing or two about the syntax of the JOINS, take a look at the linked tutorial.

Side note: We will be working with the ‘ employees’ database, so if you don’t have it on your computer, feel free to check out how to download it in the following article.

Generally, duplicate rows are not always allowed in a database or a data table. The regular practice of many companies is to clean their data from such occurrences. However, they are sometimes encountered, especially in new, raw, or uncontrolled data. In this tutorial, we will show you a way to handle them in your queries.

Creating Duplicate Records in SQL

As a matter of fact, our data is clean; it does not have any duplicate rows. Therefore, we can insert one row in the “ dept_manager_dup “ table that is identical to one of the others:

INSERT INTO dept_manager_dup 
VALUES('110228', 'd003', '1992-03-21', '9999-01-01');

The last value is ‘9999–01–01’ because we want the contract to be with an indefinite end.

We will also create a duplicate row in the other table. Let it be the ninth department, “ Customer Service “. The code should be:

INSERT INTO departments_dup 
VALUES('d009', 'Customer Service');

Results from the Code

Let’s check what happened with the two tables first. We can achieve that by writing:

SELECT 
    * 
FROM 
    dept_manager_dup 
ORDER BY dept_no ASC;

The table with the data about the managers looks like it hasn’t changed. However, the employee numbered 110228 appears there twice, as shown in the picture above.

And how about the “ departments_dup” table? We find it in the same way we did before; the only difference is that the Customer Service department is shown twice, as you can see in the picture below.

Different Output

Fine, the situation is clear. Now, look at the picture below — it is a snipped photo of the output of 20 rows, when the initial state of the two tables was used, without duplicates.

How do you expect the new output to change regarding the one from the picture? Do you think the number of presented rows will be lower, equal, or higher?

Let’s check the output of the same query, this time with the two tables containing duplicate records.

We obtained 25 rows. Where do you think these additional 5 rows came from?

Why We Have More Rows

First, we don’t have just one, but two employees with number 110228; hence, two times the department name “ Human Resources “ is appended against them. This gives us a single additional record compared to what we had earlier.

The remaining 4 extra records are at the end of this result set. The “ departments_dup” table contains a duplicate row for department number 9, “ Customer Service “. That’s why every time department number 9 from the first table is matched with department number 9 from the second table, two records are displayed.

In the picture above, you can see each of these employees twice: 111784, 111877, 111939, and 111692!

How to Handle Duplicate Records in SQL

Whether having duplicate values is a mistake in the database or you just don’t want to see duplicate rows in query results, here is how to handle duplicates:

GROUP BY the field that differs most among records. In our case, this will be the “ emp_no” column. This will stack all rows that have the same “ emp_no “. By writing the following code, we should retrieve the initial output, with no duplicate values:

SELECT 
m.dept_no, m.emp_no, d.dept_name 
FROM 
dept_manager_dup m 
JOIN departments_dup d ON m.dept_no = d.dept_no 
GROUP BY emp_no 
ORDER BY dept_no;

Keeping an Eye on Duplicate Data

This is an important tool in your arsenal. You cannot allow yourself to assume there are no duplicate rows in your data, especially if it contains millions of rows! It would be good to get used to grouping the joins by the field that differs most among records.

So, if you were thinking that there was only one type of a JOIN in SQL, let me prove you wrong. Dive into the next tutorial to see what a LEFT JOIN is.

Next Tutorial: SQL LEFT JOIN

Originally published at https://365datascience.com