Why Your SQL is Slow

Techboom
Techboom
Mar 13 · 5 min read
Much like your slow SQL query, this sleeping baby elephant isn’t going anywhere anytime soon.

Here’s the situation: you run a subscription business with hundreds of thousands of users. Users can subscribe or unsubscribe at any time, as often as they like.

To keep track of the subscriptions, you keep a table in your PostgreSQL database that looks something like this:

CREATE TABLE subscription_history (
user_id BIGINT NOT NULL
, product_id INT NOT NULL
, is_active BOOLEAN NOT NULL
, ts TIMESTAMP NOT NULL
)

To follow along with the code, you can use the following script to generate some sample data.

INSERT INTO subscription_history
( user_id, product_id, is_active, ts )
SELECT
a.n AS user_id
, ROUND(RANDOM() * 3) AS product_id
, CASE WHEN RANDOM() < 0.5 THEN FALSE ELSE TRUE END AS is_active
, TIMESTAMP '2017-03-01' + RANDOM() * INTERVAL '2 years' as ts
FROM generate_series(1, 500000) AS a(n), generate_series(1,5)

The data might not be totally realistic, but it gives us the basic attributes needed: many users updating their subscription status at unpredictable times over the course of 2 years. You can adjust how much data is inserted by changing the second argument of generate_series, currently set at “500,000”.

Go ahead and take a look at the sample data:

SELECT *
FROM subscription_history
ORDER BY user_id, ts
LIMIT 1000

Then, one foul and stormy Thursday, the head of marketing stops by your desk and asks, “Can you pull a quick report on how many active users we have right now?”

As she hovers expectantly over your desk, you cast blankly through your memories of database class so many years ago. Um… maybe something with a subquery?

You tried typing this out, but when you ran it, it was so slow that you limited it to the first 5 users for testing:

SELECT
a.user_id
, a.product_id
, MAX(a.ts) AS last_event_ts
, (SELECT is_active FROM subscription_history b WHERE b.ts = MAX(a.ts)
AND b.user_id = a.user_id AND b.product_id = a.product_id) AS current_status
FROM subscription_history a
WHERE a.user_id < 5
GROUP BY a.user_id, a.product_id

You tell your imperious marketing professional that it might take a while to process the report, and you can send it to them by end of day. Looks like another late night.

Keep in mind that even after getting the results of this query, you’d need to do some grouping and filtering. SQL is a declarative language — it works best when you tell the database engine exactly what you want. You just need to think of a better way to phrase the question.

Of course, moments after the hurried executive rushes away, another query pops into your head:

SELECT
a.user_id
, a.product_id
, a.is_active AS current_status
FROM subscription_history a
WHERE NOT EXISTS (
SELECT 1 FROM subscription_history b
WHERE b.user_id = a.user_id
AND b.product_id = a.product_id
AND b.ts > a.ts
)
ORDER BY a.user_id, a.product_id

This one only took about 8 seconds to run, and does effectively the same thing as the previous query.

The difference: instead of putting the subquery in the SELECT list, you wrapped it in a WHERE NOT EXISTS clause. Your query now means “look for any records that are the most recent entry for that user/product combo (i.e., where no records exist with a more recent timestamp).

Now the database doesn’t need to look up all those other rows… as soon as it finds out that there is at least one record where b.ts > a.ts, it’s free to stop and move on to the next task. This cuts down the cost of running your query by a lot, and you didn’t need to add any indexes.

Taking that code, here’s how to get the actual report that was requested:

SELECT
q.product_id
, COUNT(DISTINCT q.user_id) AS currently_active_users
FROM (
SELECT
a.user_id
, a.product_id
, a.is_active AS current_status
FROM subscription_history a
WHERE NOT EXISTS (
SELECT 1 FROM subscription_history b
WHERE b.user_id = a.user_id
AND b.product_id = a.product_id
AND b.ts > a.ts
)
) q
WHERE q.current_status
GROUP BY q.product_id

That took me about 5 seconds to run on my laptop, and gives you a list of each product along with the number of currently active users.

If only you could have whipped that out while your boss was standing behind you! (Though, on second thought, maybe you shouldn’t let them know that this kind of thing only takes 5 seconds now…)

Subqueries are smelly code. They exist for a reason, and sometimes they’re pretty handy — but the moment I see one, it’s like I just wandered a bit too near one of the polluted canals here in Cebu… I want to get away as soon as possible!

“WHERE NOT EXISTS” is less smelly — that’s a great pattern to be familiar with. But in this case, I’ll argue that there’s an even cleaner way to solve the problem.

WITH history AS (
SELECT
a.user_id
, a.product_id
, a.is_active AS current_status
, ROW_NUMBER() OVER (PARTITION BY a.user_id, a.product_id ORDER BY a.ts DESC) AS rn
FROM subscription_history a
)
SELECT
b.product_id
, COUNT(DISTINCT b.user_id) AS currently_active_users
FROM history b
WHERE b.rn = 1
AND b.current_status
GROUP BY b.product_id

That took me about 6 seconds to run — so only slightly slower than the previous attempt. However, now there are no subqueries to deal with, and the code looks a lot easier to modify if (for example) they ask to also include a field for how long ago it was to the previous subscription:

WITH history AS (
SELECT
a.user_id
, a.product_id
, a.is_active AS current_status
, ROW_NUMBER() OVER (PARTITION BY a.user_id, a.product_id ORDER BY a.ts DESC) AS rn
, a.ts - LEAD(a.ts)
OVER (PARTITION BY a.user_id, a.product_id ORDER BY a.ts DESC)
AS time_since_last_subscription_change
FROM subscription_history a
)
SELECT
b.product_id
, COUNT(DISTINCT b.user_id) AS currently_active_users
, AVG(time_since_last_subscription_change) AS avg_time_since_last_subscription_change
FROM history b
WHERE b.rn = 1
AND b.current_status
GROUP BY b.product_id

This one also took about 6 seconds, and I’m not sure how I would have done it with WHERE NOT EXISTS.

The way this works is simple: for each user/product, it sorts all the timestamps in descending order, listing the most recent as “1” and counting up. Then in the main select, we are filtering by this field to only get row number “1” — the most recent row. We use PARTITION BY to make the count start over for each user and each product. If you wanted to get each user’s most recent action, regardless of product, just remove “product_id” from the PARTITION BY clause.

This is accomplished using “window functions” — functions like ROW_NUMBER(), RANK(), LEAD(), LAG(), FIRST_VALUE(), etc. All these functions allow you to access other rows in the result set without using joins. Remember: more joins (usually) = slower. By allowing you to avoid expensive joins, window functions vastly expand the kinds of queries you can do without watching a spinning circle in pgAdmin for 30 minutes.

“Use window functions & WHERE NOT EXISTS!” — Peegey, the Postgres elephant

There’s a lot more you can do with the window functions in Postgres— this is only scratching the surface! Let us know in the comments what SQL-related topics you’d like us to cover next.

Written by

Techboom

Database experts from the isle of Cebu. Visit us at techboom.ai

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade