Making charts with Postgres

In 2016 at RJMetrics (now Magento Analytics), we built out yet another chart builder called the SQL Report Builder.

I was excited.

RJMetrics long built itself on a reputation for being able to do reporting “without SQL” but now we’d get the best of both worlds. Business users and data analysts could live in the same reporting system and use the power of Postgres SQL to build reports. We launched. Within a few months, over 60% of our customers had used it at least once.

It also made me realize how bad my SQL skills really were.

I’m fine at application SQL. CRUD, normalization, indexes, datatypes, foreign keys, and auto updated_at are my jam. But I realized several months ago when I sat down to build a time series report…

So I got better. Here are some SQL tips for building reports from a Postgres or Redshift data warehouse.

Time series reports

Take an event or a value, add a function (sum, count), and see it change over time. The query below gives you a value for every day.

SELECT COUNT(id) AS “orders”, to_char(date, ‘YYYY-MM-DD’) AS “day” 
FROM orders
GROUP BY “day”
ORDER BY “day”

Variation: Use ‘YYYY-MM’ for monthly data or ‘YYY-MM-DD HH24’ for hourly data.

Moving date ranges

Only want to see the last 30 days for this metric?

SELECT COUNT(id) AS “orders”, to_char(date, ‘YYYY-MM-DD’) AS “day” 
FROM orders
WHERE date > current_date — interval ‘30’ day
GROUP BY “day”
ORDER BY “day”

Variation: use interval ‘3’ month for the last three months of data

Case statements

Does your database include a bunch of enumerated IDs and their corresponding names aren’t listed in the database anywhere?

SELECT COUNT(id), 
CASE WHEN status=0 THEN 'Pending'
WHEN status=2 THEN 'Complete'
WHEN status=3 THEN 'Error'
WHEN status=4 THEN 'Returned'
ELSE 'Unknown status type'
END as "status"
FROM orders
GROUP BY status

Time between two timestamps

Your table has two timestamp columns and you want to calculate the difference between them. This query calculates the seconds between the two timestamps.

SELECT EXTRACT(epoch FROM (ended_at — started_at))
FROM orders

Variation: Check the average order duration over time

SELECT AVG(EXTRACT(epoch FROM(ended_at — started_at))), to_char(started_at, 'YYYY-MM-DD') as “day”
FROM orders
GROUP BY “day”
ORDER BY “day”

IN clauses

For the love of SQL, don’t do this

SELECT * FROM orders
WHERE store_id = 100
OR store_id = 127
OR store_id = 156
OR store_id = 189
OR store_id = 212

Do this

SELECT * FROM orders
WHERE store_id IN (100, 127, 156, 189, 212)

JOIN against a key value table for WHERE clauses

If your database is sufficiently complex, it’s likely you have at least one key-value table, maybe a bunch. Their data might look like this.

id user_id key value

Example rows might look like this

1 1 name Kim
2 1 email kim@gmail.com
3 1 tier premium
4 2 name Thor
5 2 email thor@gmail.com
6 2 tier basic
7 3 name Ben
8 3 email ben@bengarvey.com
9 3 tier premium

These can be tricky to join against, because you can easily end up writing nasty, many-to-many queries. Here’s a strategy I like to use if you need to use the key-value table in a WHERE clause. This query gets all orders for users in the premium tier who are on gmail.

SELECT COUNT(o.id) FROM orders o
WHERE o.user_id IN
(SELECT u.user_id FROM users u WHERE u.tier = 'premium')
AND o.user_id IN
(SELECT u.user_id FROM users u WHERE u.email LIKE '%gmail.com')

Now, be careful because these queries aren’t going to scale well and won’t be very performant on huge data sets, but whoever set this db up in the first place knew that tradeoff, right? RIGHT?