Basic SQL Concepts and Query Strategies

Josh Orrick
J&T Tech
Published in
7 min readApr 8, 2022

Aspiring data scientist or data analyst? Then one thing you most definitely will need to know for the job is how to extract relevant data from a database. Companies store their data in a structured database, and you as an scientist/analyst have to know how to query it to answer business relevant questions. In this post, I will walk through some basic examples, and some tips and tricks for the basic types of problems that any aspiring data scientist/analyst might encounter.

Case Study

We will examine a case study where a diner has just hired you to run their analytics department (yes, in 2022 a diner has an analytics department!). Danny’s diner has data about sales, the menu, and members of its frequent customer club. The datasets have the following relationship:

Danny’s Diner Data Tables

In terms of lingo, we would say that each set of data resides in a table, named sales, members, or menu. These tables live within the same database, called dannys_diner although in reality they don’t have to. Our goal is simple: write a query to the database to gather pieces of information we desire. We do this through the use of Structured Query Language (SQL), which is our interface to selecting data we want. Let’s go through some examples of how to approach writing SQL for questions we might be interested in.

Simple Queries

How many unique customers are there?

So, we want to determine how many unique customers there are. This is similar to the problem of just listing out all of the customers that have made a purchase in our diner. Let’s write two queries to find both. The problem solving strategy for SQL queries should be:

  1. Identify what we want as a result of our query: we want a listing of unique customers, and then a count of unique customers.
  2. Work backwards from the result. What groupings/aggregations/ordering would you need to get to the result?
  3. What fields must you select as part of your result?

Since we want a listing of unique customers, we want to return each individual customer record. We only want to return distinct records. This should give us a record of each customer who made a purchase. Here is the query and result:

Query
Result

A couple of observations:

  • We used SELECT in order to state what we actually want to return from the table.
  • We used the word DISTINCT to return only unique customer_id field records.
  • We had to use FROM to tell the system what table to lookup our data from. Notice the syntax, we have the database.table format.

Now, let’s return the total number of unique customers. We know the result should be 3 from the results, but here is the query we can use:

Notice that the aggregation function COUNT helped count the total records we had from the distinct customers. Also, we returned the result under a new alias using the AS clause. This can be really helpful to label our results.

How many days has each customer visited the diner?

To answer this, we need to gather up each customer’s order records. We will use the GROUP BY clause to aggregate the count of order_id’s for each customer. Anytime we need to aggregate data for each value in a group, we can use the GROUP BY clause:

The distinct order dates per customer allow us to examine each customer’s rows in the table, and total up how many days they visited. When we use a GROUP BY clause, it has the effect of first selecting all rows relevant to a single customer, then work with the data all related to that single customer.

More Complex Queries

Some questions may require us to utilize multiple tables for the information we need. For example, notice that we have a sales table with sales records, but then a menu table with actual data on each menu item a customer might purchase in a sale. So, these tables are linked together, and we can run a JOIN operation to join together the tables for queries.

What is the total amount spent by each customer?

This is another aggregation, and we really want to highlight each customer and understand how much money did they spend. This of course is related to their products they ordered, so we need to connect each sale record to a product record in the menu, and grab the price. Then it’s a matter of summing the prices for each product record for each customer:

A few observations:

  • The customer_id is included in what we select because it’s used in the GROUP BY clause. Anytime a GROUP BY is used, any fields in the SELECT clause have to be either used in the GROUP BY clause or have to be involved in an aggregation.
  • The key to this query is the JOIN operation. We want to merge together the sales and menu table, and link it together based on the field product_id. This is what we call the “join key”. We take all records from each table and merge them together into the rows where the product_id’s are equal. This has the effect of appending new columns to the sales table that give relevant product data for each product_id.
  • The ORDER BY clause at the end is optional, but it presents the results in an orderly fashion by having the largest total as the first returned record.

Complex Queries

Some queries might require us to calculate a “subquery” first, and then use it to gather information from that intermediate result. Some queries might also require us to partition the table into rankings or windows of time to aggregate our results over. Let’s take a look at some examples.

What item was the most popular for each customer?

This requires a ranking of each of the customer’s purchases. To do a ranking, we must instruct our query to partition by a group (very similar to a GROUP BY ), and within this partition, create an ordering by some value. In our case, we want to partition the data by customer and determine which product had the most orders as our ranking. Here is a way to generate a ranked_item subquery, than can then be queried for the top ranking product:

So, here are some observations:

  • The WITH clause sets up a named subquery that we can use in subsequent queries. Think of it as returning an intermediate table that we can actually query from. The intermediate result is stored in the table rank_item.
  • We select a RANK() OVER clause which will create a rank column in the returned table. Inside the clause we specify how to construct the ranking. In this case, we rank product_name counts for each customer.
  • The final query actually queries our intermediate result table rank_item, and just returns the ranks and products for each customer. This doesn’t directly answer the question, but gives us all the data we need to, we just have to downselect to the number 1 ranked item for each customer. This is easily accomplished with a WHERE clause in the final query:

Notice that customer B actually has a tie where Ramen, curry, and sushi all were ordered twice as the top ranked order.

Conclusion

SQL can be hard! If you want that data scientist or analyst role, you are going to have to practice more and more queries. It’s worth noting that the syntax isn’t exactly what’s important, it’s the problem solving process. Here, I was using Postgres13, but any SQL language will be similar. The case study I used for the examples is from Danny Ma’s website, which I highly encourage a visit to: https://8weeksqlchallenge.com/case-study-1/. This can provide some more practice and even an interactive SQL session to practice those queries.

Happy queries!

--

--

Josh Orrick
J&T Tech

Mathematician by trade, converted data scientist and ML enthusiast.