Mastering SQL Subqueries: A Comprehensive Guide

Unlock the power of SQL Subqueries — from basics to nested and correlated subqueries. Learn how to effectively analyze complex data directly within your database.

Manu Mulaveesala
12 min readMay 11, 2023
Stream of light particles emanating from a central narrow hallway space (representing immensity of data)
Photograph by Joshua Sortino (Unsplash)

Have you ever found yourself overwhelmed by the complexity of managing and manipulating databases? Maybe you have dabbled in SQL (Structured Query Language), but have not gotten to the point where you feel you could actually solve real business problems with your SQL queries. The leap from Beginner-level understanding of SQL and SQL mastery is not so far as one might think. This SQL article series will be about enabling people to make that leap of mastering SQL in a simple-to-understand, comprehensive way.

Today, our topic is one of those things that separates the SQL-beginners from the SQL elite: Subqueries. One of SQL’s most robust features is the ability to use subqueries, which are like queries within queries, significantly enhancing your data handling capabilities. This article will delve into various aspects of SQL subqueries, including basic subqueries, nested subqueries, correlated subqueries, and subqueries in the FROM clause.

Meme of Yoda from Star Wars that says: “Learn SQL, we must”

Basic Subqueries

In SQL, a subquery is a query that is nested inside a main query. Subqueries can return individual values or a list of records; these values can then be used by the main query for comparison purposes. A basic subquery is a standalone query that can run independently and is embedded within another query. It’s a self-contained unit that produces a result set, which is then used by the outer query. Essentially, it’s like going to the market to fetch specific items (subquery), then using those items to cook a meal (outer query).

They’re used to answer complex questions that need data from multiple tables, and they can be a powerful tool when combined with operators like IN, EXISTS, ANY, ALL, and others.

The basic structure of a subquery looks like this:

Basic Structure of the general form of a subquery.
Subqueries represented Diagrammatically — Query is a Circular/Elliptical space with Subquery as another space inside the larger Query space. The diagram also mentions that the outer circle can be the Main query, or an outer query and inner circle is Subquery or “inner” query. There may be many levels of this. Text inside image: “The subquery passes its final value to the main query.”
A visual guide to a subquery as Data spaces. Source.

Let’s use an example with a simple data set of one of the most basic (and common) types of relational databases. Imagine we have two tables: Orders and Customers. The Orders table contains all the orders made by customers, while the Customers table contains all the customer details.

OurOrders table contains individual order information and looks like this:

Orders table

OurCustomerstable just matches customer names to their respective customer ID and looks like this:

Customers table

Here’s how we can modify the basic structure of a subquery shown above to perform this task:

The subquery (the part within the parentheses) is executed first.

(SELECT CustomerID FROM Customers WHERE Name = 'Alice') fetches the CustomerID for Alice and the outer query uses that to find all the products ordered by Alice. The CustomerID selected here will result in a resultingCustomerIDof 1 since we can see that this is the corresponding ID for Alice.

Therefore the value1 will then be the result for the subquery and the full query would then essentially “reduce” to the following:

This main query gets executed next. It retrieves the OrderID, Product, and Quantity from the Orders table where the CustomerID is the same as the one returned by the subquery (which is Alice’s CustomerID = 1).

The ‘operator’ in the original query can be any comparison operator like =, >, <, ≥, <=, etc., depending on the specific condition you want to apply.

In summary, this SQL query is using a subquery to retrieve records from a table based on a condition that relies on information from another table.

Just a cautionary note, always ensure that your subquery returns a single value when using comparison operators like =, >, <, etc. If the subquery might return more than one result, consider using operators like IN, ANY, or ALL, again depending on the specific requirements of your query.

Why Subqueries are Useful

In the previous example, it would seem that we could have just used the reduced query and “hard-coded” the CustomerID = 1 , but the reality is that we may not know this information going in. In programming, we always want to try and include the most “general” or “robust” means of coding. If we did not know the customer ID beforehand (or rather, wanted to directly search for the appropriate ID, based on a name like “Alice,” our subquery comes in really handy! In the real world, we would probably add a further level of abstraction so that the resultant tables are automatically generated programmatically for any name, not just for “Alice.”

Nested Subqueries

Nested subqueries are subqueries that contain another subquery. They are often used when the outer query’s results depend on the result of a subquery that itself depends on another subquery. The innermost subquery runs first, then its result is used in the next outer subquery, and so on, until the final outer query can be processed.

The nesting of subqueries can run several layers deep. Picture it as a set of Russian nesting dolls. Each nested subquery is a doll within the outer doll, and you cannot reach the innermost doll without opening the outer ones. In the same vein, the innermost query cannot execute without the outer ones.

Meme of Leonardo Decaprio from Inception squinting his eyes: “What If I Told You, You Could Query A Query?” — refers to the nature of Nested Subqueries (A subquery within a subquery)

Case Study: Nested Subquery for Top Customers

Sometimes in a business context, we want to understand which customers are our most “loyal” or most “frequent” customers. We may want to compare some customers in relation to other customers or to our customer pool using statistical measures.

Suppose that we wanted to understand which of our customers are our “top customers” defined as those customers who make an “above average” number of orders. How could we use SQL to isolate the customer information for the customers that have purchased more items than the “average” customer?

First, we have to define, “What is an average customer?” In this case, we have a limited amount of information, so the best “metric” to judge customer loyalty would be the quantity purchased. Therefore we will want to take the AVG(Quantity) across all our orders, while keeping track of which Customers made those purchases.

Let’s take a sneak peek at the final query and then work backward to break it down:

In this case, we actually have two sets of parentheses nested (or layered) within one another. Therefore this is a “subquery within a subquery” or a nested subquery. In order to understand how the SQL interpreter handles this type of command, recall the following principle:

Innermost subqueries are evaluated before outer queries.

Inner Queries Before Outer Queries

Therefore, the best way to think about this query is to understand the logic of the query:

------ 1)Inner Query 
------ Get the Average Quantities
----2)Outer Query
---- Get the Customer IDs of customers WHERE they buy more than average
--3)Outermost Query
-- Get the Customer names of the respective CustomerIDs

Now that we’ve gotten the basic parts of the components of our query nailed down, we’ll want to order the query in such a way that we can get these components going from Outermost to Innermost. Remember that the interpreter will still interpret the logic as we provided above, but the query itself will need to be as follows:

In this example, the innermost subquery calculates the average number of orders per customer. The next outer subquery uses this average to find customers who have placed more than the average number of orders. The outermost query then gets the names of these customers.

This was a complex query, but don’t worry if you didn’t get it all at once. Take your time to analyze each part, and remember that the innermost query executes first, and its result is used by the outer queries.

Correlated Subqueries

A correlated subquery is a type of SQL subquery where the inner query depends upon the simultaneous execution of the outer query. In simpler terms, the inner query may be executed once for each row processed by the outer query. This interdependence distinguishes correlated subqueries from their uncorrelated counterparts, in which the inner query can be run independently once and its result used by the outer query. In a correlated subquery, the subquery relies on the outer query for its values. This means that the subquery executes once for every row processed by the outer query.

Case Study: Employee Salaries

Imagine you are helping out a friend in the HR department pull some numbers and he asks, “Can you list the names of the employees who earn more than the average salary within their respective departments?” How would you consider approaching this problem with a correlated subquery?

This query does the following:

  1. Goes through each employee in the “Employee” table, aliasing it as “e1”.
  2. Checks if the salary of each employee is greater than the average salary of all employees who are in the same department (determined by matching the “DepartmentID” of the outer query’s employee “e1” with the “DepartmentID” of the inner query’s employees “e2”).
  3. If an employee’s salary is greater than the average salary in their department, their name will be included in the final result set.

Is this the only way to answer the question above? Certainly not! Here is another way we can tackle this going back to our nested subqueries, but by adding in a JOIN clause.

  1. In the subquery, we calculate the average salary (AvgSalary) for each department (DepartmentID), grouping the data by department.
  2. We then join this subquery (e2) with the original Employee table (e1) on the DepartmentID field, which matches each employee with the average salary for their department.
  3. In the WHERE clause, we filter out the employees whose salary is greater than the average salary of their department (e1.Salary > e2.AvgSalary).
  4. Finally, we select the EmployeeName from the result.

An Analogy for Subqueries: The Exam Stack

A student fills out a testing bubble sheet (Image is for our analogy about exams and subqueries vs correlated subqueries)
Photo by Nguyen Dang Hoang Nhu on Unsplash

Imagine you’re a teacher grading a stack of exams. A regular subquery would be like having a list of the average scores for each subject. You could use this list (the result of the subquery) to easily find out which students scored above average in their subjects.

A correlated subquery, on the other hand, would be like grading each paper, and for each question, comparing the student’s answer to the answers given by all other students who took the same exam. For each student (each row in the main query), you have to go through the entire stack of exams (execute the subquery) to get the information you need.

Subquery Correlation and Performance

Unlike the subqueries we’ve seen so far, a correlated subquery cannot be run independently of the outer query because it refers to columns in the outer query. For each record processed by the outer query, the subquery is executed anew with the current values of the outer query’s columns.

The “correlation” occurs because the subquery uses information from the outer query and the subquery executes once for each row retrieved by the outer query. This correlation between the inner and outer queries allows the inner query to return different results for each row processed by the outer query.

It’s essential to note that while correlated subqueries can be very powerful, they can also be slower than standard subqueries or joins, because they may need to execute once for each record returned by the main query. Hence, they should be used carefully and it is important to account for performance while using them. Therefore it is always best practice to limit the number of rows first and test your performance before moving on to running your correlated subqueries on a large dataset.

For this reason, the alternate version of the Employee salary example provided that includes a JOIN clause actually would be expected to perform faster than a correlated subquery all things kept equal. However, the correlated subquery is a little bit easier to read, so it really depends on the size of the dataset you may be dealing with and whether production efficiency or code readability is more important.

Subqueries in the FROM Clause

Finally, subqueries can be used in the FROM clause of an SQL statement.

Sometimes, it’s useful to treat the result of a subquery as though it were a table itself. SQL allows us to do this by including a subquery in the FROM clause of a SELECT statement.

Let’s suppose we have a Sales table with columns Region, Year, and Revenue, and we want to find the region with the highest total revenue for each year. We could solve this with a subquery in the FROM clause:

In this example, the subquery groups the sales by year and region and calculates the total revenue for each group. The outer query then takes this result and finds the maximum total revenue for each year.

Subquery: Yearly Sales by Region

First, we have to get the Yearly sales by adding up the total revenue for the whole year and stratifying the results ( GROUP BY ) by Year and Region .

Main query: Max Total Revenue for each year

Now that we have the yearly sales figures for each year and region, it is really a matter of finding the maximum values within this new resulting table, as shown by the query above. So, overall, this query returns a table with each year from the sales data, and for each year, the highest total revenue that was achieved in any region.

Think of getting subqueries in the FROM clause simply as a means of changing the table that we are SELECT ing from!

All the Flavors of Subqueries

Subqueries are queries nested inside other queries. They can be used in various clauses like SELECT, FROM, WHERE, and HAVING. They enable complex database queries and can be used to perform operations that cannot be done in a single query.

Types of Subqueries:

Hierarchical Diagram of Types of Subqueries: Single Row, Multiple Row, Nested, Multiple Column, Correlated, Uncorrelated
  • Single Row Subqueries: These are subqueries that return only one row from the inner SELECT statement. They are often used with operators like =, >, <, >=, <=, or <>. For instance:
SELECT column_name(s)
FROM table_name
WHERE column_name = (
SELECT column_name
FROM table_name
WHERE condition);
  • Multiple Row Subqueries: These subqueries return more than one row from the inner SELECT statement. They are often used with operators like IN, ANY, or ALL. For example:
SELECT column_name(s)
FROM table_name
WHERE column_name IN (
SELECT column_name
FROM table_name
WHERE condition);
  • Nested Subqueries: These are subqueries that exist within another subquery. They allow multiple layers of filtering and extraction. For instance:
SELECT column_name(s)
FROM table_name
WHERE column_name IN (
SELECT column_name
FROM table_name
WHERE column_name IN (
SELECT column_name
FROM table_name
WHERE condition));
  • Multiple Column Subqueries: These subqueries allow multiple columns to be returned in the SELECT clause of the subquery. This can be used when you need to compare more than one column result with the outer query. For example:
SELECT column_name1, column_name2
FROM table_name
WHERE (column_name1, column_name2) IN (
SELECT column_name1, column_name2
FROM table_name
WHERE condition);
  • Correlated Subqueries: These are subqueries that refer to a column in a table in the outer query. They are executed once for each candidate row in the outer query. For instance:
SELECT column_name(s)
FROM table_name t1
WHERE column_name operator (
SELECT column_name
FROM table_name t2
WHERE t1.column_name = t2.column_name);

Correlated vs Uncorrelated Subqueries:

  • Uncorrelated subqueries: These can be run independently of the outer query. They are executed once for the entire query. Remember that they (inner subqueries) can run independently of the outer (main) query.
  • Correlated subqueries: These depend on data from the outer query. They are executed once for each row processed by the outer query. These will often have a reference like WHERE t1.column_name = t2.column_name to the outer query making them non-independent (correlated).
An image that has 3 triangles mentioning some types of operators that support the SQL subquery: 1) WHERE Clause 2) HAVING Clause 3) FROM Clause

Practical Applications of Subqueries:

Subqueries can be used to find aggregate values over certain partitions of data.

Examples in the real world:

  • Finding the maximum revenue per year across multiple regions
  • Identifying employees who earn above the average salary in their respective departments.
  • Identifying customers who appear to be purchasing more frequently than the rest of the customer base

Tying it All Together: To Subquery or Not to Subquery?

Meme from Napoleon Dynamite: “My SQL Query worked, so I guess you could say things are getting pretty serious”
My SQL “Subquery” worked and so things are getting pretty serious.

Remember that it is important to remember that not every SQL problem statement requires the same hammer of a “subquery.” Sometimes there may be alternative solutions using JOINs, window functions, or grouping sets that offer better performance or readability. As you expand your SQL skill set and continue to practice, you’ll begin to be able to not only see alternate solutions but be able to test these various solutions for yourself. There is more than one way to slice the SQL cat!

Keep at it and you will see how these seemingly complex and challenging queries gradually become second nature.

Remember, SQL is a tool, and like any tool, it’s all about using the right feature at the right time for the right job. Subqueries are a powerful part of the SQL toolset, enabling complex data analysis directly within the database. So the next time you’re working with SQL, remember these insights into subqueries and harness their full potential.

Enjoyed this piece? Click the Follow button to get my latest programming and data science guides and tutorials straight to your Medium feed!

Thanks for Reading!

--

--