Three Absolutely Different Approaches to the Same Task: Register Shifts SQL Case Study

Igor Plotnikov
Learning SQL
Published in
7 min readNov 8, 2023
Image source: AI-generated

Today we are going to look at three distinct approaches to collecting shifts from an events log table. We’ll compare the efficiency and performance of these strategies and try to find out which approach works best. While some of the resulting insights may surprise you, others may not. Either way, you’ll find something curious in it. Let’s get into it.

Consider a log table register_events containing POS register events with the following schema:

| id      | created_at              | event_type_id |
|---------|-------------------------|---------------|
| 1001 | 2023-11-01 08:31:39.872 | 1 |
| 1002 | 2023-11-01 09:42:50.055 | 2 |
| 1003 | 2023-11-01 11:53:23.532 | 2 |
| 1004 | 2023-11-01 12:54:32.745 | 2 |
| 1005 | 2023-11-01 16:59:08.812 | 2 |
| 1006 | 2023-11-01 18:03:02.229 | 3 |
| ... | ... | ... |

Here we have auto-incremented id , created_at as a timestamp for event related to a register, and event_type_id as identifier for event type - 1 for opening a register shift, 3 for closing a register shift, and 2 for any other transaction. So basically 1 and 3 are types that we are interested in while solving the case.

Also given that we won’t be able to benefit from indexing as we are working with a log table. If there are indexes on the table, the database must make sure the new entry is also found via these indexes. For this reason, it must add a new entry to every index on that table. It leads to poor performance as log tables are naturally designed for a lot of frequent insert statements.

All the three described further SQL strategies should retrieve absolutely the same result which is the following:

| opened_at               | closed_at               |
|-------------------------|-------------------------|
| 2023-11-01 08:31:39.872 | 2023-11-01 18:03:02.229 |
| 2023-11-02 08:31:07.041 | 2023-11-02 18:40:27.827 |
| 2023-11-03 08:37:53.176 | 2023-11-03 18:42:39.245 |
| 2023-11-04 08:40:11.063 | 2023-11-04 18:10:26.340 |
| 2023-11-05 08:12:06.643 | 2023-11-05 18:09:18.846 |
| 2023-11-06 08:42:26.396 | 2023-11-06 18:35:20.283 |
| ... | ... |

That is the only expected result. Each record represents a register shift here with specific opening and closing timestamps, making it easier to analyze the duration of the shift and maybe other events happening during it.

Test conditions

  • SQL dialect: PostgreSQL 15.0
  • Size of register_events: 15477 rows

1. INEQUALITY JOIN WITH MIN()

RDBMS availability: ubiquitous

An inequality JOIN is a type of join operation where the join condition involves an inequality comparison between columns from the tables being joined. In contrast to equality joins, an inequality join is used when you want to join rows based on values that meet a specified inequality condition.

WITH shift_openings AS (
SELECT
created_at
FROM register_events
WHERE event_type_id = 1
),
shift_closings AS (
SELECT
created_at
FROM register_events
WHERE event_type_id = 3
)
SELECT
so.created_at AS opened_at,
MIN(sc.created_at) AS closed_at
FROM shift_openings so
JOIN shift_closings sc
ON so.created_at < sc.created_at
GROUP BY 1

Let’s understand what’s happening here. We allocate shift openings and closings into separate subqueries, then merge them using an inequality join, ensuring that the closing time is always later than the opening time. Finally, we select the minimum retrieved timestamp for each opening using the aggregating MIN function.

Let’s look at the query plan:

HashAggregate  (cost=41283.23..41297.30 rows=1407 width=16) (actual time=437.942..438.424 rows=1407 loops=1)
Group Key: register_events.created_at
Batches: 1 Memory Usage: 193kB
-> Nested Loop (cost=0.00..40183.42 rows=219961 width=16) (actual time=0.032..435.010 rows=1407 loops=1)
Join Filter: ((register_events.created_at < register_events_1.created_at) AND ((register_events_1.created_at - register_events.created_at) < '1 day'::interval day))
Rows Removed by Join Filter: 1978242
-> Seq Scan on register_events (cost=0.00..293.46 rows=1407 width=8) (actual time=0.016..3.985 rows=1407 loops=1)
Filter: (event_type_id = 1)
Rows Removed by Filter: 14070
-> Materialize (cost=0.00..300.50 rows=1407 width=8) (actual time=0.000..0.084 rows=1407 loops=1407)
-> Seq Scan on register_events register_events_1 (cost=0.00..293.46 rows=1407 width=8) (actual time=0.006..2.004 rows=1407 loops=1)
Filter: (event_type_id = 3)
Rows Removed by Filter: 14070
Planning Time: 1.083 ms
Execution Time: 438.727 ms

Not a great result to be honest. The query appears to be relatively complex due to the nested loop join and the filter conditions. And yes, before the aggregation this inequal join produces multiple redundant records which can be a high price if we deal with huge amount of data. But nevertheless it is a straightforward approach, just conventional SQL which is supported by litterally every SQL dialect. So if you are going to run this with some archaic SQL engine like MySQL 5.7 (which has no features considered further), you’ll get the desired result in the end.

2. JOIN LATERAL WITH LIMIT 1

RDBMS availability:

  • Oracle since 12c
  • PostgreSQL since 9.3
  • MySQL since 8.0.14

If you are using a different dialect check if this feature is available for you.

Lateral joins are a way to correlate the result of a subquery with the rows from the outer query. This can be useful when you need to reference values from one table or subquery within the context of another table or subquery in the same query. The lateral join ensures that the subquery has access to the current row being processed in the outer query, making it possible to reference the outer row’s values in the subquery.

SELECT
scu.*
FROM register_events so
INNER JOIN LATERAL (
SELECT so.created_at as opened_at, sc.created_at as closed_at
FROM register_events sc
WHERE so.created_at < sc.created_at
AND sc.event_type_id = 3
ORDER BY created_at
LIMIT 1
) as scu
ON so.created_at = scu.opened_at
WHERE so.event_type_id = 1

In this approach, we utilize a combination of JOIN LATERAL with sorting and limiting in the nested subquery. The principle is similar to the previous one, although using different SQL techniques.

  • The JOIN LATERAL clause specifies a subquery that retrieves the created_at timestamp for the earliest shift closing event that occurs after the shift opening event.
  • The LIMIT 1 clause limits the number of shift closing events to one.

Here is query plan:

Nested Loop  (cost=334.50..470970.14 rows=469 width=16) (actual time=2.001..2707.531 rows=1407 loops=1)
-> Seq Scan on register_events so (cost=0.00..293.46 rows=1407 width=8) (actual time=0.019..2.518 rows=1407 loops=1)
Filter: (event_type_id = 1)
Rows Removed by Filter: 14070
-> Subquery Scan on scu (cost=334.50..334.51 rows=1 width=8) (actual time=1.919..1.919 rows=1 loops=1407)
Filter: (so.created_at < scu.created_at)
-> Limit (cost=334.50..334.50 rows=1 width=8) (actual time=1.917..1.918 rows=1 loops=1407)
-> Sort (cost=334.50..335.67 rows=469 width=8) (actual time=1.914..1.914 rows=1 loops=1407)
Sort Key: sc.created_at
Sort Method: quicksort Memory: 25kB
-> Seq Scan on register_events sc (cost=0.00..332.15 rows=469 width=8) (actual time=0.853..1.846 rows=704 loops=1407)
Filter: ((so.created_at < created_at) AND (event_type_id = 3))
Rows Removed by Filter: 14773
Planning Time: 0.206 ms
Execution Time: 2708.136 ms

This result is even inferior to the previous one. Indeed, if there’s data skew in the outer table, meaning some rows have a significantly higher number of correlated rows in the subquery than others, JOIN LATERAL can become slow and even lose to the standard JOIN. This is because the subquery will be executed more frequently for the skewed rows, causing a performance imbalance. Also if the subquery involves complex ordering logic, it can slow down the query, especially if there’s a need to sort large result sets multiple times. In another situation JOIN LATERAL would shine, but this is just not the case.

3. WINDOW (ANALYTIC) FUNCTION LEAD()

RDBMS availability:

  • Oracle since 8.1.6
  • PostgreSQL since 8.4
  • MySQL since 8.0

If you are using a different dialect check if this feature is available for you.

The LEAD() window function is a powerful SQL feature used to access the value of a column from the next row within a result set. It's commonly used for tasks like time-based analysis, detecting patterns, and calculating differences or intervals between consecutive rows in SQL queries.

WITH re AS (
SELECT
event_type_id,
created_at,
LEAD(created_at) OVER
(ORDER BY created_at) AS next_event_at,
LEAD(event_type_id) OVER
(ORDER BY created_at) AS next_event_type_id
FROM register_events
WHERE event_type_id IN (1, 3)
)
SELECT
created_at AS opened_at,
next_event_at AS closed_at
FROM re
WHERE event_type_id = 1
AND next_event_type_id = 3

The re CTE retrieves data from the original table, using the window function LEAD() to determine the next_event_at (timestamp of the next event) and next_event_type_id (the event type of the next event) for each row. The CTE is filtered to include only rows where the event_type_id is either 1 or 3. The query then applies a filter in the WHERE clause, specifying that it should only retrieve rows where the event_type_id is 1 (indicating a shift opening) and the next_event_type_id is 3 (indicating a shift closing). This filter isolates the events of interest, where an opening event is followed by a closing event. The query is pretty simple and readable.

Ok, what will query plan tell us?

Subquery Scan on a  (cost=454.68..553.17 rows=1 width=16) (actual time=2.786..5.437 rows=1407 loops=1)
Filter: ((a.event_type_id = 1) AND (a.next_event_type_id = 3))
Rows Removed by Filter: 1407
-> WindowAgg (cost=454.68..510.96 rows=2814 width=24) (actual time=2.785..4.909 rows=2814 loops=1)
-> Sort (cost=454.68..461.72 rows=2814 width=12) (actual time=2.773..2.924 rows=2814 loops=1)
Sort Key: register_events.created_at
Sort Method: quicksort Memory: 250kB
-> Seq Scan on register_events (cost=0.00..293.46 rows=2814 width=12) (actual time=0.018..2.329 rows=2814 loops=1)
Filter: (event_type_id = ANY ('{1,3}'::integer[]))
Rows Removed by Filter: 12663
Planning Time: 0.184 ms
Execution Time: 5.614 ms

The difference in executing time is tremendous. No nested loops, and just look at how fast WindowAgg step is. Unlike self-joins or subqueries, which leads to increased memory usage, window functions work directly with the existing rows in the result set. Window functions are literally designed for solving such kind of tasks and provide advanced capabilities that are not easily achievable with standard SQL constructs. Therefore, it’s a best approach in our case.

I hope you learned for yourself something curious. Keep experimenting and refining your SQL skills to stay ahead in the world of data.

--

--

Igor Plotnikov
Learning SQL

BI/ Data Engineer. This blog is about my day-to-day working challenges and how I approach them