Side Effects of SQL Joins in One-to-Many Relationships

4 min readAug 21, 2019

I told my friend, who is a data analyst turned data PM, I was learning SQL and how to join tables and his first response was “did you learn about side effects of using joins with certain table relationships?” We did not explicitly learn about this, so I decided to dive into it more and share my findings.

“The nature of joins when the relationship is 1 to many means that you can duplicate some records and you sometimes have to keep that in mind based on your query” — my smart friend who is a SQL pro

Let’s do a quick refresher on SQL joins using our classic Book and Author example. Authors can have many books, and a book belongs to an author (for this example), so this is a one-to-many relationship.

Our authors table has an id primary key, name, and age:

authors
id   |   name              |   age
1    |   James Patterson   |   72
2    |   Dan Brown         |   55
3    |   JK Rowling        |   54

Our books table has an id primary key, title, page_count, and author_id, which is a foreign key referencing an authors id:

books
id    |   title                    |   pages      |   author_id
1     |   Along Came a Spider      |   435        |   1
2     |   The Da Vinci Code        |   454        |   2
3     |   The Lost Symbol          |   528        |   2
4     |   HP & Sorcerer's Stone    |   309        |   3
5     |   HP & Chamber of Secrets  |   341        |   3

A foreign key is used to form a relationship between two tables, where the foreign key references the primary key of another table. Here, we are saying that each book belongs to a specific author.

We can (inner) join these tables with the following SQL:

SELECT *
FROM authors
JOIN books
ON authors.id = books.author_id

So now our table looks like:

id| name       | age| id| title                  | pages | author_id
1 | JPatterson | 72 | 1 | Along Came a Spider    | 435   | 1
2 | Dan Brown  | 55 | 2 | The Da Vinci Code      | 454   | 2
2 | Dan Brown  | 55 | 3 | The Lost Symbol        | 528   | 2
3 | JK Rowling | 54 | 4 | HP & Sorcerer's Stone  | 309   | 3
3 | JK Rowling | 54 | 5 | HP & Chamber of Secrets| 341   | 3

Now that we understand how to join tables with a one-to-many relationship, what side effects should we be aware of?

Let’s work through some examples using aggregate functions on our authors and books tables:

SELECT COUNT(*)
FROM authors         # => 3SELECT AVG(age)
FROM authors         # => 60.33333SELECT SUM(age)
FROM authors         # => 181SELECT COUNT(*)
FROM books           # => 5SELECT AVG(pages)
FROM books           # => 413.4SELECT SUM(pages)
FROM books           # => 2067

Now let’s use those same queries on our joined table:

SELECT COUNT(*)
FROM authors
JOIN books
ON authors.id = books.author_id     # => 5SELECT AVG(age)
FROM authors
JOIN books
ON authors.id = books.author_id     # => 58SELECT SUM(age)
FROM authors
JOIN books
ON authors.id = books.author_id     # => 290SELECT AVG(pages)
FROM authors
JOIN books
ON authors.id = books.author_id     # => 413.4SELECT SUM(pages)
FROM authors
JOIN books
ON authors.id = books.author_id     # => 2067

Huh?! Why don’t they all match up? Our queries using COUNT, AVG, and SUM functions on the authors side of our joined table don’t match up to the same queries above on just the authors table. But our queries using aggregate functions on the books side of the table do match up between the books table and the joined table.

Since each author can have many books, we may have multiple rows for the same author in our joined table, and then trying to COUNT, SUM, or AVG any of the authors columns will cause those authors to be counted multiple times. So we can’t trust any of the aggregate functions used on the authors columns (the “one” side of the relationship).

Each book belongs to an author, so each book will only ever have one row in our joined table, which is why aggregate functions still work as expected on the books columns (the “many” side of the relationship).

Okay, I get it, but how can I get around this?

There are a few things we can do.

We can add DISTINCT within our aggregate functions, which takes only the distinct/unique values:

SELECT COUNT(DISTINCT authors.id)
FROM authors
JOIN books
ON authors.id = books.author_id     # => 3SELECT AVG(DISTINCT age)
FROM authors
JOIN books
ON authors.id = books.author_id     # => 60.33333SELECT SUM(DISTINCT age)
FROM authors
JOIN books
ON authors.id = books.author_id     # => 181

Be careful using DISTINCT if you have duplicate rows and multiple authors of the same age, since then we would want that age to be counted multiple times for each author that is that age, but not count it multiple times for each of their books. In other words, we want the “dupes” to be counted in some instances but not others.

We can also use a subquery to first get each author.id from our joined table, which gives us (1, 2, 2, 3, 3). Then we are finding the rows from the authors table where the id is within our subquery and aggregating those ages:

SELECT AVG(age)
FROM authors
WHERE id IN 
(SELECT authors.id
FROM authors
JOIN books
ON authors.id = books.author_id)     # => 60.33333SELECT SUM(age)
FROM authors
WHERE id IN 
(SELECT authors.id
FROM authors
JOIN books
ON authors.id = books.author_id)     # => 181

Key takeaway — be careful using aggregate functions in a one-to-many relationship on the “one” side of the relationship.

Happy joining!

Sources

More dangerous subtleties of JOINs in SQL - petralia

Baron Schwartz gives two great examples of just how common - yet profoundly consequential - semantic errors in SQL can…

alexpetralia.com

Aggregate Functions Gone Bad and the Joins Who Made Them that Way

If you're using SQL some of the first things you probably learned about were joins and aggregate functions (such as…

looker.com

https://dzone.com/articles/dangers-sql-joins-aggregate