Making the Most of Your Data: An Introduction to SQL Joins
Joins are the key feature of relational databases. They allow you to tap into information across tables by exploiting their relationships. They’re a way of returning a new set of data made up of columns from multiple tables (or just one in the case of a self join).
We use joins extensively at FlexMR, for example, our authorisation system, our algorithms for matching participants to research tasks, for extracting data for use in our analysis tools, and several of our GDPR compliance processes all involve intricate joins.
Using object relational mapping (ORM) tools can negate a lot of the need to write joins by hand, but there are times where the ORM doesn’t directly support what you’re trying to do; or, it does support it but does so in a sub-optimal way.
The ability to write your own Structured Query Language (SQL) queries is a valuable skill for software developers that work with relational databases.
Following Along
The example queries in this article make use of the excellent Chinook database. This is a ready-made schema and dataset that can be imported into a range of different databases.
You can import this into the database of your choice, and experiment with these queries yourself. Using the SQLite version with DB Browser for SQLite is a quick way to get going, however SQLite does not support RIGHT JOIN.
Joining the Tables
Joins are about combining columns based on matching data between tables. For example, if you wanted to retrieve a list of Album
rows with each Track
you would join on the AlbumId
column that is in both tables.
The type of join to use depends on which table’s data is most relevant to you, and how you want the results to be returned.
Note: the terms ‘Left’ and ‘Right’ refer to the tables in your queries. You SELECT FROM
the ‘Left’ table, and JOIN
the ‘Right’ table.
INNER JOIN
An ‘INNER JOIN’ will return rows from both the Left and Right tables, but only when there is a match between them. They are only used when you need data from both tables as no data will be returned unless there is related data between them.
In this example, we want to see a set made up of Album
rows and their corresponding Track
rows.
This returns:
Notice that there is a returned row for each Track
row. If there were no Track
records for a given AlbumId
then that Album
would not appear in these results. There must be a match between the two tables for an INNER JOIN to return anything.
LEFT (OUTER) JOIN
One quick note: there is no difference between a LEFT JOIN
and a LEFT OUTER JOIN.
You can use either interchangeably. If you’re mixing different joins in a single query you might find it helps readability to include OUTER
, but it is not required.
A LEFT JOIN
will return all records from the Left table, regardless of if there are any matching rows in the Right table. Any columns you SELECT
from the Right table that don’t have any data will be returned as NULL.
The Playlist
table makes for a good example. There is a one-to-many relationship between Playlist
and PlaylistTrack
, although not all of the Playlist
rows have corresponding PlaylistTrack
rows.
In this query, we SELECT
every column from both tables.
This returns 8719 rows, of which the first 20 look like this:
As we are more interested in the Playlist
records, we can filter this down to returning just one row per Playlist
by adding a GROUP BY.
Which returns each Playlist
along with the first matched PlaylistTrack
record.
Note the NULL
results for the two ‘Movies’ and ‘Audiobooks’ rows.
This is good data but, with a few tweaks, we can turn it into great information. In this next query, we’re going to:
- Show the name of the first track in the playlist with an additional join against the
Track
table. - Show how many tracks are in each playlist by using
COUNT
- Not show any of the Id columns by tweaking the initial
SELECT
Which returns the following:
RIGHT (OUTER) JOIN
A RIGHT JOIN
will return all rows from the Right table. As with a LEFT JOIN
any missing data is returned as NULL.
In practice, this type of join is rarely used. A RIGHT JOIN
can almost always be replaced with a LEFT JOIN
by reversing the order of the tables. For example,
SELECT * FROM Playlist LEFT JOIN PlaylistTrack
Is functionally the same as:
SELECT * FROM PlaylistTrack RIGHT JOIN Playlist
To keep your SQL portable between databases, it is recommended to use LEFT JOIN
rather than RIGHT JOIN.
Self JOIN
A Self JOIN is used when you want to create a data set comprised of data from more than one row on a single table.
When using a self join, you must set and use an alias for each reference to the table.
The Employee
table lists each employee and has the key ReportsTo
to reference the row of their boss. By using a self join we can list each employee along with the name of their boss.
This returns the following:
Unfortunately, ‘Andrew Adams’ is missing from the data. This is because he does not have a boss (he’s the General Manager) and so he was excluded from the results of the INNER JOIN.
We can change that INNER JOIN
to a LEFT JOIN
to ensure he is included. To avoid a NULL
value in the ‘Boss’ column, we can use a COALESCE
to provide a more appropriate value. To make the data easier to read, we can order the results by the ReportsTo
column.
Self joins can be used together with joins to other tables. This next query expands on the previous one by joining to the Customer
table to get the number of customers each employee has. As it’s an INNER JOIN
, only employees with customers will be returned. Swapping this for a LEFT JOIN
would ensure that all Employee
rows are returned, with 0
as their ‘Number of Customers’ value.
Knowing how and when to use LEFT, INNER, and Self Joins grants you a lot of power to exploit the data in your database. There are other join types to explore, such as CROSS and FULL joins; but they have much narrower use-cases and aren’t always supported across different databases.
With a bit of practice, joining tables together to find the data that you want becomes intuitive. Once you’ve mastered joins, you can combine them with other database features and functions to extract really interesting information from your data.