SQL Fun and Games

5 min readMar 9, 2024

Finding the last one . . .

One of the common query types which is deceptively difficult is: “find the last order for each customer.” You need to return one row for each customer no matter how old the last order was. Simple, right? Not really.

The Problem

You have a table named “ReadingList” in MSSQL database. This table has a list of records of users who added a book into ReadingList and when the book was added. Table data: ReadingList

UserID  BookID  DateAdded
1       1       2015-1-1
1       2       2015-10-1
1       3       2015-6-1
2       4       2015-1-1

Then here is the question: Please write a SQL query to select the latest book which was added for each user.

Correct output

UserID  BookID  DateAdded
1       2       2015-10-1
2       4       2015-1-1

Setup the data

DECLARE @ReadingList TABLE
 ( UserId INT NOT NULL, 
    BookId INT NOT NULL, 
    DateAdded DATETIME NOT NULL);

 INSERT INTO @ReadingList ( UserId, BookId, DateAdded ) VALUES ( 1, 1, '2015-1-1' )
 INSERT INTO @ReadingList ( UserId, BookId, DateAdded ) VALUES ( 1, 2, '2015-10-1' )
 INSERT INTO @ReadingList ( UserId, BookId, DateAdded ) VALUES ( 1, 3, '2015-6-1' )
 INSERT INTO @ReadingList ( UserId, BookId, DateAdded ) VALUES ( 2, 4, '2015-1-1' )

I have used the problem above many times while interviewing programmers with a strong SQL background. Or, at least they said that they were good at it! I always asked people to rate themselves from 0–10 on various skills like C#, JavaScript, and SQL Server. Anything over 5 and I gave them the test. The large majority failed to produce a correct answer. Many started with a GROUP BY such as:

SELECT UserID, MAX(BookID) AS BookID, MAX(DateAdded) AS DateAdded
FROM   @ReadingList
GROUP BY UserID

But that does not work. If they clearly were not going to get a correct solution even with a couple of hints, I would show them each of the 6 solutions below and have them tell me if it would work or not. I’d tell them that all may work, some might work, or none. Their ability to read and understand code became clear as they analyzed these queries. When they had guessed at all 6, I ran the query and asked them to look at any they missed now that they knew the answer.

Solutions

1

-- Query 1             0.01146446            
-- Query with order by 0.0260059            
 
WITH ReadingData
AS (SELECT UserId
         , MAX(DateAdded) DateAdded
 FROM @ReadingList
 GROUP BY UserId)
SELECT rl.*
FROM ReadingData dr
  INNER JOIN @ReadingList rl ON rl.UserId = dr.UserId
                                AND rl.DateAdded = dr.DateAdded

2

-- Query 2             0.01146446            
-- Query with order by 0.0260059            
SELECT rl.*
FROM @ReadingList rl
WHERE rl.DateAdded =
(
 SELECT MAX(rl2.DateAdded)
 FROM @ReadingList rl2
 WHERE rl2.UserId = rl.UserId
)

3

-- Query 3             0.01146446            
-- Query with order by 0.0260059            
SELECT rl.UserId
  , rl.BookId
  , rl.DateAdded
FROM @ReadingList rl
  INNER JOIN
  (
         SELECT UserId
           , MAX(DateAdded) AS DateAdded
         FROM @ReadingList
         GROUP BY UserId
  ) t ON rl.UserId = t.UserId AND rl.DateAdded = t.DateAdded

4

-- Query 4             0.0114646            
-- Query with order by 0.014645              
SELECT rlr.UserId
  , rlr.BookId
  , rlr.DateAdded
FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY rl.UserId 
        ORDER BY rl.DateAdded DESC) RowRank
 FROM @ReadingList rl
) rlr
WHERE rlr.RowRank = 1

5

-- Query 5             0.0065716
-- Query with order by 0.0179324            
SELECT rl.*
FROM @ReadingList rl
WHERE NOT EXISTS
(
 SELECT 1
 FROM @ReadingList rl2
 WHERE rl2.UserId = rl.UserId AND rl2.DateAdded > rl.DateAdded
)

6

-- Query 6             0.0065711            
-- Query with order by 0.0179328            
SELECT rl.*
FROM @ReadingList rl
  LEFT JOIN @ReadingList rl2 ON rl2.UserId = rl.UserId AND rl2.DateAdded > rl.DateAdded
WHERE rl2.UserId IS NULL

7

-- Query 7             0.0068546            
-- Query with order by 0.018223            
SELECT rl.*
FROM @ReadingList rl
WHERE EXISTS(
 SELECT b.UserID, MAX(b.DateAdded) MaxDate
 FROM @ReadingList b
 GROUP BY b.UserID
 HAVING B.UserID = rl.UserID AND MAX(b.DateAdded) = rl.DateAdded)
ORDER BY rl.UserId

The small number to the right on the solutions (e.g. 0.01146446) is the subtree cost of the query. What is remarkable is that the first 3 have the exact same actual query plan:

The 4th query is almost identical to the first three and the last 2 are completely different than the first 4 but they have almost identical query plans themselves:

We either have a sort and a table scan or two table scans. Which is always faster? Neither. Any one of these patterns may be more or less efficient depending on the row count, indices, and even the statistics. If you look at the default statistics for these queries, they are poor.

Stylistically, I do not like queries 1 and 3. For queries with lots of joins, I find that having the correlated subquery in the JOIN rather than the WHERE clause more difficult to read but that does not make them bad choices given the right circumstances.

These types of queries are often useful in removing duplicates which is often quite challenging but that is another topic.

LEFT JOIN or NOT EXISTS?

The topic of left joins vs not exists get a lot of coverage like this. I used to work with someone who insisted that the LEFT JOIN was always the most efficient. Since he was the CTO, I used that pattern, but while that was true in early versions of SQL Server, it is not true now. When I’ve scaled this exact query, the LEFT JOIN pattern did poorly. In this situation, neither the left join or the not exists are the most efficient solutions.

Happy Coding

Happy coding and if you are going to be interviewed about SQL, if you come to clearly understand these 7 queries and why they work, I think you will be ahead of the game. And, if you have a different solution, I would love to hear about it and I will add it to the list and give you credit for it.