Three useful, yet neglected, SQL statements

Mailei Vargas
The Startup
Published in
5 min readAug 7, 2019

Make your database extraction easier or prep for your next interview…

Schools wouldn’t neglect to teach the ABC’s before teaching to read or write, so why do many data science programs often neglect to teach SQL thoroughly before teaching some of the more complex data science concepts?

When I first began working as a Data Analyst, I was green when it came to writing SQL queries. I had done some online crash courses and learned the basics such as SELECT, UPDATE, CREATE, WHERE, GROUP BY, HAVING, LIMIT, SUM, AVG, etc. It wasn’t until I was asked to extract some more complex finds that I realized there were some major limitations to my SQL knowledge.

I want to share with you some of the more useful SQL statements that may make your quest for that perfect outcome easier. These statements are CASE, FILTER, and WITH.

Let’s begin with the very useful CASE statement.

My very knowledgeable engineering coworker made me aware of CASE. CASE is SQL’s way of handling IF/ELSE or IF/THEN logic.

It is typically organized as seen below:

CASE WHEN <condition1> THEN <result1> 
[WHEN <condition2> THEN <result2>
. . . ]
[ELSE <result>]
END

The END statement is necessary to complete the statement block, however the ELSE statement is optional.

Let’s say we have a table with the following schema:

From this table you want a count of the number of patients who are categorized into each decade of life. (e.g. 20s, 30s, etc.) per physician. Here is one way you could structure the query:

SELECT 
physician_last_name,
CASE WHEN age >= 60 THEN '60+'
WHEN age >= 50 THEN '50s'
WHEN age >= 40 THEN '40s'
WHEN age >= 30 THEN '30s'
WHEN age >= 20 THEN '20s'
WHEN age >= 13 THEN 'Teens'
ELSE '12 or younger'
END AS age_group,
COUNT(*) AS age_group_count
FROM tutorial.patient_list
GROUP BY 1, 2
ORDER BY 1, 2;

Below is a sample of the resulting table:

Which brings me to the next useful SQL statement (one of my favorites!), FILTER.

I find the FILTER statement very useful when trying to create the equivalent of a pivot table as in Excel (or other languages). A pivot-like table can also be accomplished with the CASE statement mentioned above used inside of an aggregate function, however I find FILTER to be more straightforward and readable.

FILTER is used inside of an aggregate function and is structured as seen below:

SELECT SUM(<expression>) FILTER (WHERE <condition>)

NOTE: You can replace SUM with any other aggregate function.

So let’s pivot the above table of count, by age group, by physician using FILTER. Here is how the query would look:

SELECT
physician_last_name,
COUNT(*) FILTER (WHERE age >= 60) AS “60+”,
COUNT(*) FILTER (WHERE age >=50 AND age < 60) AS "50s",
COUNT(*) FILTER (WHERE age >= 40 AND age < 50) AS "40s",
COUNT(*) FILTER (WHERE age >= 30 AND age < 40) AS "30s",
COUNT(*) FILTER (WHERE age >= 20 AND age < 30) AS "20s",
COUNT(*) FILTER (WHERE age >= 13 AND age <20) AS "Teens",
COUNT(*) FILTER (WHERE age <= 12) AS "12 or younger"
FROM tutorial.patient_list
GROUP BY 1
ORDER BY 1;

And the resulting pivot-like table appears as the following:

Lastly is the widely unused but very handy WITH statement.

Let me first explain why I needed WITH. WITH has saved me many times when a nested SELECT statement just wasn’t cutting it for what I needed. The company I worked for had a large database of many relational tables. I needed to join 5 tables with many aggregated SELECT, WHERE and GROUP BY clauses in order to get all the info I needed for a single client in one place. Then I needed to do an aggregation on the joined tables. In my research for trying to get this into one query, I stumbled across WITH.

Now let me explain what WITH does. WITH is an alternative to a VIEW, but the result is not stored in the database schema. Instead, it creates a temporary table only valid in the query it belongs to. This makes it possible to improve the structure of a statement without polluting the global namespace.

WITH temp_table_name (column_name1, ...) AS
(SELECT ...)
SELECT column_name
FROM temp_table_name ...

Here is an example to demonstrate how WITH could be used.

Let’s say I have executed the following query:

WITH acquisitions_paid AS (
SELECT
ca.acquirer_name,
ca.acquired_year,
cc.category_code,
cc.region,
SUM(ca.price_amount) total_paid_in_acquisitions
FROM tutorial.crunchbase_companies cc
INNER JOIN tutorial.crunchbase_acquisitions ca
ON cc.permalink = ca.company_permalink
WHERE cc.state_code = 'CA'
GROUP BY 1, 2, 3, 4
HAVING SUM(ca.price_amount) > 0
)
SELECT
region,
AVG(total_paid_in_acquisitions) avg_per_region_2010
FROM acquisitions_paid
WHERE acquired_year = 2010
GROUP BY 1;

The inner query joins two tables and aggregates the total price paid for acquisition per acquiring company, year, category, and region. The name of the temporary table is acquisitions_paid.

Here is a sample of the temporary table acquisitions_paid:

Now typical SELECT queries can be executed on this table. The outer query above produces the following result.

Conclusion

In my quest to become a better and more efficient Data Analyst/Data Scientist I realize that requires getting better at SQL, a language many courses and programs neglect to properly teach. However, at my job as a Data Analyst, and from speaking to many Data Scientists, SQL is very fundamental for extracting warehoused data in order to do some of the “cooler” visualizations and modeling.

Let’s give SQL more attention and credit where it is due!

References:

  1. Mode analytics has public databases that were used in this article. They also contain tutorials.
  2. Discussion on the global namespace in SQL

--

--

Mailei Vargas
The Startup

Data Detective | Problem Solver | Puzzle Lover | Trail Junkie | Are you going to eat that?