SQL Fun and Games
Finding the last one . . .
One of the common query types which is deceptively difficult is: “find the last order for each customer.” You need to return one row for each customer no matter how old the last order was. Simple, right? Not really.
The Problem
You have a table named “ReadingList” in MSSQL database. This table has a list of records of users who added a book into ReadingList and when the book was added. Table data: ReadingList
UserID BookID DateAdded
1 1 2015-1-1
1 2 2015-10-1
1 3 2015-6-1
2 4 2015-1-1
Then here is the question: Please write a SQL query to select the latest book which was added for each user.
Correct output
UserID BookID DateAdded
1 2 2015-10-1
2 4 2015-1-1
Setup the data
DECLARE @ReadingList TABLE
( UserId INT NOT NULL,
BookId INT NOT NULL,
DateAdded DATETIME NOT NULL);
INSERT INTO @ReadingList ( UserId, BookId, DateAdded ) VALUES ( 1, 1, '2015-1-1' )
INSERT INTO @ReadingList ( UserId, BookId, DateAdded ) VALUES ( 1, 2, '2015-10-1' )
INSERT INTO @ReadingList ( UserId, BookId, DateAdded ) VALUES ( 1, 3, '2015-6-1' )
INSERT INTO @ReadingList ( UserId, BookId, DateAdded ) VALUES ( 2, 4, '2015-1-1' )
I have used the problem above many times while interviewing programmers with a strong SQL background. Or, at least they said that they were good at it! I always asked people to rate themselves from 0–10 on various skills like C#, JavaScript, and SQL Server. Anything over 5 and I gave them the test. The large majority failed to produce a correct answer. Many started with a GROUP BY such as:
SELECT UserID, MAX(BookID) AS BookID, MAX(DateAdded) AS DateAdded
FROM @ReadingList
GROUP BY UserID
But that does not work. If they clearly were not going to get a correct solution even with a couple of hints, I would show them each of the 6 solutions below and have them tell me if it would work or not. I’d tell them that all may work, some might work, or none. Their ability to read and understand code became clear as they analyzed these queries. When they had guessed at all 6, I ran the query and asked them to look at any they missed now that they knew the answer.
Solutions
1
-- Query 1 0.01146446
-- Query with order by 0.0260059
WITH ReadingData
AS (SELECT UserId
, MAX(DateAdded) DateAdded
FROM @ReadingList
GROUP BY UserId)
SELECT rl.*
FROM ReadingData dr
INNER JOIN @ReadingList rl ON rl.UserId = dr.UserId
AND rl.DateAdded = dr.DateAdded
2
-- Query 2 0.01146446
-- Query with order by 0.0260059
SELECT rl.*
FROM @ReadingList rl
WHERE rl.DateAdded =
(
SELECT MAX(rl2.DateAdded)
FROM @ReadingList rl2
WHERE rl2.UserId = rl.UserId
)
3
-- Query 3 0.01146446
-- Query with order by 0.0260059
SELECT rl.UserId
, rl.BookId
, rl.DateAdded
FROM @ReadingList rl
INNER JOIN
(
SELECT UserId
, MAX(DateAdded) AS DateAdded
FROM @ReadingList
GROUP BY UserId
) t ON rl.UserId = t.UserId AND rl.DateAdded = t.DateAdded
4
-- Query 4 0.0114646
-- Query with order by 0.014645
SELECT rlr.UserId
, rlr.BookId
, rlr.DateAdded
FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY rl.UserId
ORDER BY rl.DateAdded DESC) RowRank
FROM @ReadingList rl
) rlr
WHERE rlr.RowRank = 1
5
-- Query 5 0.0065716
-- Query with order by 0.0179324
SELECT rl.*
FROM @ReadingList rl
WHERE NOT EXISTS
(
SELECT 1
FROM @ReadingList rl2
WHERE rl2.UserId = rl.UserId AND rl2.DateAdded > rl.DateAdded
)
6
-- Query 6 0.0065711
-- Query with order by 0.0179328
SELECT rl.*
FROM @ReadingList rl
LEFT JOIN @ReadingList rl2 ON rl2.UserId = rl.UserId AND rl2.DateAdded > rl.DateAdded
WHERE rl2.UserId IS NULL
7
-- Query 7 0.0068546
-- Query with order by 0.018223
SELECT rl.*
FROM @ReadingList rl
WHERE EXISTS(
SELECT b.UserID, MAX(b.DateAdded) MaxDate
FROM @ReadingList b
GROUP BY b.UserID
HAVING B.UserID = rl.UserID AND MAX(b.DateAdded) = rl.DateAdded)
ORDER BY rl.UserId
The small number to the right on the solutions (e.g. 0.01146446) is the subtree cost of the query. What is remarkable is that the first 3 have the exact same actual query plan:
The 4th query is almost identical to the first three and the last 2 are completely different than the first 4 but they have almost identical query plans themselves:
We either have a sort and a table scan or two table scans. Which is always faster? Neither. Any one of these patterns may be more or less efficient depending on the row count, indices, and even the statistics. If you look at the default statistics for these queries, they are poor.
Stylistically, I do not like queries 1 and 3. For queries with lots of joins, I find that having the correlated subquery in the JOIN rather than the WHERE clause more difficult to read but that does not make them bad choices given the right circumstances.
These types of queries are often useful in removing duplicates which is often quite challenging but that is another topic.
LEFT JOIN or NOT EXISTS?
The topic of left joins vs not exists get a lot of coverage like this. I used to work with someone who insisted that the LEFT JOIN was always the most efficient. Since he was the CTO, I used that pattern, but while that was true in early versions of SQL Server, it is not true now. When I’ve scaled this exact query, the LEFT JOIN pattern did poorly. In this situation, neither the left join or the not exists are the most efficient solutions.
Happy Coding
Happy coding and if you are going to be interviewed about SQL, if you come to clearly understand these 7 queries and why they work, I think you will be ahead of the game. And, if you have a different solution, I would love to hear about it and I will add it to the list and give you credit for it.