Introduction to SQL: An Intermediate Guide — Part 2 of a 3 Part Series

Andy Sawyer
9 min readApr 28, 2024

--

This article aims to provide a stepped guide to SQL, and while it can be read in isolation, extends on the basic introduction found here. SQL is indispensable in roles such as Data Analyst and Data Engineer, offering a way to query data, configure database structures, and more. We will explore intermediate SQL functionalities including eliminating duplicates, combining results, CTE’s, sub-queries, and window functions. Let’s dive into the intermediate level queries and build a solid foundation in SQL.

Our Simple Database

First things first, if you’ve skipped the basics article, let’s take a look at what our database looks like, and get a copy set up to play with. It’s only four tables:

  1. employee;
  2. customer;
  3. department; and
  4. sale.

These four tables are related to one another (hence the term ‘relational database’). Below is the Entity Relationship Diagram (ERD) for the database:

Entity Relationship Diagram

Setting Up Your Database

I’m going to use a website called SQLiteOnline. It’s free, and it allows you to set up a basic database and have a play with it. You don’t need to set up an account, but make sure you’ve selected PostgreSQL from the left hand side:

Screenshot from website

You’ll find copies of the four files that create and populate the four tables in my Github repository here. You need to copy/paste the contents of each file in the create_table/ folder into the website, and hit the ‘run’ button:

Screenshot from website

You’ll end up with the four tables listed, as shown in my first screenshot.

Eliminating Duplicates

There are times when you might have a table, and you want to see what unique values exist in a column. There might be millions of records, but only a handful of unique values. For this, you can use the DISTINCT clause. This comes right at the top of the query before the column (or columns) that you want to look at.

SELECT DISTINCT
region
FROM
sale

The above will list out the regions in which sales have occurred. If there are only 4 regions there will only be 4 results, even if there are millions of sales records.

Combining Results

In the last article, we looked at joining tables together in a way that allows you to add columns from multiple tables to produce a wider dataset. But what if the tables have the same columns and you want to ‘stack them’ one on top of another to create a dataset that contains the combined data represented as more rows? For this, you can use the UNION statement.

UNION and UNION ALL

Combining results from multiple queries is streamlined with UNION, which merges and removes duplicates, or UNION ALL, which includes all records, including duplicates if the same value appears in both tables:

SELECT
first_name,
last_name
FROM
employee
UNION
SELECT
first_name,
last_name
FROM
client

The above will create a dataset that contains all first and last names for all employees and clients together.

By default, a UNION produces a distinct set of data. For clarity, you can also write this as UNION DISTINCT. If you have an employee that is also a client, their name will only be listed once. If you want to see it listed twice (once from each table), you can use the UNION ALL query.

To show which table the data came from, you can add a string as an alias. In the below example, I’ve expanded on the above by adding the table name as an additional column within each dataset:

CTEs and Sub-queries

As your queries become more complex, you will want to improve their structure for readability. Common Table Expressions (CTEs) and sub-queries are powerful tools in SQL that can simplify complex queries, improve readability, and make your SQL scripts more maintainable. Below we can look into these in more depth.

Common Table Expressions (CTEs)

CTEs, or Common Table Expressions, provide a way to define a temporary result set that you can reference within another SQL statement. They are often used to simplify complex joins, aggregations, or to isolate specific operations in SQL queries.

Advantages of CTEs

  • Readability: CTEs make queries easier to read by breaking them into manageable, named blocks.
  • Maintainability: Changes can be made easily in one place without needing to alter multiple areas of the query.
  • Recursive Queries: CTEs support recursive queries, which are useful for dealing with hierarchical or tree-structured data, like organisational charts or category trees.

Example Usage

WITH RegionalSales AS (
SELECT
region,
SUM(sale) AS total_sales
FROM
sale
GROUP BY
region
)
SELECT 
region,
total_sales
FROM
RegionalSales
WHERE
total_sales > 20000

In this example, the CTE RegionalSales is used to calculate total sales by region, which is then referenced in the main query to filter regions greater than 20,000 in sales. This is a very simple example, which is an alternative to the above HAVING clause. But you can have multiple CTEs in a single query, and then bring them all together at the end with a number of joins in your final SELECT statement.

Sub-queries

Sub-queries are queries nested inside another query. They can be used in various parts of a main SQL statement, including SELECT, FROM, and WHERE clauses.

Advantages of Sub-queries

  • Flexibility: Sub-queries can be used almost anywhere in SQL statements, allowing for dynamic construction of complex conditions and calculations.
  • Isolation: Sub-queries can isolate calculations and conditions within a query, making parts of the SQL statement independent from others.

Example Usage

SELECT 
first_name,
last_name,
(
SELECT
COUNT(*)
FROM
sale
WHERE
sale.employee_id = employee.id
) AS num_sales
FROM
employee

This example uses a sub-query to count the number of sales for each employee directly in the SELECT clause.

CTEs vs. Sub-queries

When to Use Which?

  • Complexity and Performance: CTEs can sometimes offer better performance for complex queries, especially when the same sub-query would have to be executed multiple times in different parts of a query.
  • Recursion Needs: If you need to perform a recursive operation, CTEs are the way to go. We’ll look at recursive CTE’s in the next article.
  • Single Use Computations: For conditions or calculations that are used only once, a sub-query might be simpler and more direct.

Both CTEs and sub-queries enhance the power and readability of SQL queries. The choice between using a CTE or a sub-query often depends on the specific requirements of the query, such as the need for recursion, reuse of results, and the overall impact on readability and performance. The one piece of advice I would give however is that if you could write your query with either a CTE or a sub-query, I would opt for the CTE every time. They are far easier to read and understand. It will make ‘future you’ happy when something goes wrong and you need to go back and debug your code.

Window Functions

A window function is a type of function in SQL that operates on a subset of rows, known as a “window,” within the result set of a query. Unlike aggregate functions that we saw in the previous post, which collapse multiple rows into a single result, window functions perform calculations across a set of rows while preserving the individual rows in the output. This means that each row in the result set can have its own value computed by the window function based on the defined window.

Here’s a breakdown of key concepts related to window functions:

  1. Window: The window defines the set of rows over which the window function operates. It can be defined using the PARTITION BY clause, which divides the result set into partitions, and the ORDER BY clause, which specifies the order of rows within each partition.
  2. Partitioning: PARTITION BY divides the result set into partitions or groups of rows. The window function is applied separately to each partition, and calculations are performed within each partition independently.
  3. Ordering: ORDER BY determines the order of rows within each partition. It defines the logical ordering of rows used by the window function to calculate results. Ordering is optional but often used to establish meaningful sequences for functions like ranking or calculating moving averages.
  4. Frame: The frame further refines the window by specifying a subset of rows within the partition defined by the window. It is specified using clauses like ROWS BETWEEN or RANGE BETWEEN. The frame allows you to define the range of rows over which the window function operates within each partition.
  5. Types of Window Functions: Common window functions include ranking functions (e.g., ROW_NUMBER(), RANK(), DENSE_RANK()), aggregate functions (e.g., SUM(), AVG(), MAX(), MIN()), and analytic functions (e.g., LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE()). These functions can be applied to the rows within the window to calculate results based on specific criteria.

Here’s an example of a window function:

SELECT
CONCAT(employee.first_name, ' ', employee.last_name) as employee,
CONCAT(client.first_name, ' ', client.last_name) as client,
sale.date,
sale.region,
sale.sale,
ROUND(AVG(sale.sale) OVER (PARTITION BY sale.employee_id, sale.region), 2) AS average_sale
FROM
sale
INNER JOIN employee
ON sale.employee_id = employee.id
INNER JOIN client
ON sale.client_id = client.id
ORDER BY
sale.date,
sale.employee_id,
sale.region,
sale.client_id

In this example, we’re doing a number of things so I’ll step through them.

We’ve started with the sale table, and joined in both the employee and client tables. In doing so, we’re able to get the employee and client names. On lines 2 & 3, you can see that we’ve used the CONCAT function to join some fields and values together. On line 2 we’ve joined the employees first name, and a <space>, and their last name, and aliased this field to ‘employee’. When the query runs it shows their full names. We’ve done the same with the client name.

We’ve pulled in the date and the region for each transaction, and then added the sale amount. The last column is our window function. Here, we’ve taken the average sale amount for the employee in the region, related to the record that is being shown. We’ve used the ROUND function as taking an average may result in many decimal places.

You can see the result of this clearly in the second and third result. Adam Knight has two transactions in the North region. One of them is for $193.21, and the second is for $491.35. For both of these, the last column (our window function) shows that the average sale amount for Adam in the North is $342.28.

Another good use for a window function is using the ROW_NUMBER() function with a CTE, and then filtering on the row number that is returned. In the below query, we’re able to filter to see only the most recent record for each of our employees:

WITH
base_data AS (
SELECT
CONCAT(employee.first_name, ' ', employee.last_name) as employee,
CONCAT(client.first_name, ' ', client.last_name) as client,
sale.date,
sale.region,
sale.sale,
ROw_number() OVER (PARTITION BY sale.employee_id ORDER BY sale.date DESC) AS row_num
FROM
sale
INNER JOIN employee
ON sale.employee_id = employee.id
INNER JOIN client
ON sale.client_id = client.id
)

SELECT
*
FROM
base_data
WHERE
row_num = 1

In some databases (like Snowflake for example) you can put the window function into a QUALIFY statement, and remove the need to wrap the query as a CTE.

Conclusion

This article expanded on the first post that introduced very basic topics, and added DISTINCT, UNION, CTE’s and sub-queries and window functions. In the final post in the series we can look at at recursive CTE’s and pivoting data as advanced query types.

--

--

Andy Sawyer

Bringing software engineering best practices and a product driven mindset to the world of data. Find me at https://www.linkedin.com/in/andrewdsawyer/