Basic SQL statements you need to know (GROUP BY/HAVING/INNER JOIN)

Little Dino
CodeX
Published in
8 min readJun 24, 2022

Introduction

In the previous article, we talked about the very basic SQL statements. In this article, we’ll focus on slightly more complex ones — GROUP BY, HAVING, and INNER JOIN, yet they are very powerful.

SQL statements

In the following demonstration, we’ll use the same table as an example. The table is called customers, which records the information of customers. It has 4 features — ID, Name,City, Age, and there are 6 customers in this table. Further, the primary key is ID since you can only use ID to uniquely identify each row.

If you are unfamiliar with relational database, go check out my previous article!

1. GROUP BY

The first one is used to group rows that have the same values into summary rows, and you can decide which type of summary you want (i.e., MAX/MIN/AVG/SUM/COUNT).

Say we want to know the average age of our customers in each city. How do we group our customers?

Yes, by city. Then, we can use the AVG function to get the average age in each city. To be more specific, the query goes like SELECT City, AVG(Age) FROM customers GROUP BY City;, and it returns —

What if we now want to know the total number of customers in each city?

We change the function to COUNT. One thing to note is that COUNT(*) will include duplicates and null values, it simply calculates the number of rows. If you want to count the number of UNIQUE rows, you have 2 options.

  • Add DISTINCT keyword before column name. It will remove duplicates (see how to use DISTINCT) — SELECT City, COUNT(DISTINCT Name) Customers FROM customers GROUP BY City;.
  • Count the number of rows in Primary key. As we talked about before, primary key is used to uniquely identify the rows. There won’t be any duplicate values in it, so we can use this property to find the number of unique rows — SELECT City, COUNT(ID) Customers FROM customers GROUP BY City;.

The two methods generate the same result, even though Name is not primary key.

⚡ The second column name is Customers because we rename the column by assigning new name a space after the original column name.

That’s pretty much how we use GROUP BY! Next question is that what if we want to find the city with at least 2 customers?

You might think of WHERE clause (see how to use WHERE). Unfortunately, WHERE CANNOT be used with aggregate functions, such as COUNT, SUM, AVG. Therefore, we need another clause — HAVING.

2. HAVING

To put it simply, HAVING acts exactly same as WHERE, except that it’s designed for aggregate functions. If we want to find the city with at least 2 customers, we simply write SELECT City FROM customers GROUP BY City HAVING COUNT(ID) > 1;, which returns —

If you check the previous result, you will find Paris is indeed the only city with at least 2 customers. Similarly, if you want to find the city which customers have average age higher than 30, you write SELECT City FROM customers GROUP BY City HAVING AVG(Age) > 30;.

3. INNER JOIN

Knowing how to extract information from 1 table is useful and critical, but a lot of times we would want to merge information from multiple tables. In particular, INNER JOIN allows us to select records that have matching values in 2 tables.

The syntax goes like SELECT * FROM table1 INNER JOIN table2 ON table1.column1 = table2.column2. As you might notice, ON is used to select the columns you want to match. For instance, you have 2 tables in the database. One stores the personal information of customers, another stores the order information of customers. We might want to match 2 tables to see the order preference of customers in a particular age range, so we need to join 2 tables. Which column should we choose?

I know I didn’t give you the detailed information of 2 tables, but we’d use customer IDs. ID is appropriate because it can be used to uniquely identify each customer. Imagine we use name as matching attribute. It’s almost certain that there must be more than 1 customers with the same name, how do we know exactly who orders which items? Is it the John in Chicago or the John in London?

Let’s do an interesting experiment. Say we inner join the customers table with itself on Name attribute — SELECT * FROM customers c1 INNER JOIN customers c2 ON c1.Name = c2.Name. Yes, we can do that (might look a little bit insane I know). What would the result be like?

Ideally, each customer should be match to himself/herself. However, this is not what happened, the result is actually—

Most customers are merged as we expected, but John appears 4 times. Why would that happen?

Yes, you are right, we didn’t use primary key (ID) to merge tables. Hence, John in Chicago has to merge with himself and John in London. Same thing happens to another John. John in London has to merge with himself and John in Chicago. Now you see the problem. We prefer using primary key as matching attributes to avoid this problem (i.e., the matching customers are in fact different people).

However, if you insist, we can still use Name to merge 2 tables, only that we need an extra column — City. Since our table is pretty small, you can see the only problem is that there are 2 John live in different city. If we use both Name and City to merge table, they will be viewed as different people. In other words, we use multiple column conditions. The query goes like — SELECT * FROM customers c1 INNER JOIN customers c2 ON c1.Name = c2.Name AND c1.City = c2.City;, and it returns —

Problem fixed! John in Chicago is only matched to himself, and John in London is also matched to himself. Nevertheless, our database is so small that we can find the problem manually. If our database is large enough, we should find 2 people with same name and age live in the same city. In that case, we’ll still need ID to merge tables (which is actually the best option).

Sometimes you might be able to use INNER JOIN without raising any error, but the result is problematic or does not reflect the reality. An analysis based on incorrect information retrieval will be biased, and the decision based on biased analysis will have consequences. Thus, we’d want to be as careful as we can, and here are some notes that can help us reduce errors.

  • Avoid ambiguity by renaming tables and specifying table’s name

From the query above, you might notice I renamed the first customers table to c1, the second customers table to c2. Then, when I refer to a column, I was specific about the table, i.e., c1.Name (Name column from first customers), c2.Name (Name column from second customers). The renaming is mandatory in this case since the names of 2 tables are exactly the same. Moreover, the specification of table name is also mandatory here since the Name column appears in both table. If we don’t specify which Name we are referring to, SQL will be confused and return error.

However, if you have different table names and different column names, neither renaming nor specifying is mandatory. Having said that, we tend to rename the table to a shorter name for convenience (don’t need to type the whole name when specifying). We also specify table name before the column for clarity and debugging (It’s easier to debug when you know where a column is from). Thus, renaming and specifying are absolutely good habits to keep.

  • No matching value —Returns empty table

When there is no matching value in the column, SQL just returns an empty table. This is normally a sign of failed query. You might use the wrong table, refer to the wrong column, or the database itself has issues.

  • Check result on partition of tables

A table in the real world might have tens of thousands of rows (or even more). It’s difficult to figure out if the join result is correct given that amount of data. What I often do is partitioning the table first and validate the join result on small tables. For instance, We can randomly take 20 rows from each table, inner join them, and check if the result is as expected. Then, we repeat the validation process a few times since the rows are randomly selected. We might encounter situation where the query works for part of data (i.e., Johns in different cities were not selected in validation dataset), not the whole dataset.

  • Select the columns we need

As you can see, the columns in both tables will be displayed even when they have exact same names and matched values. We should select the column we need to avoid redundancy. For example, we join 2 tables c1, c2 on Name and City attributes, then the resulting values in these 2 columns will be the same in both tables (which is essentially the meaning of join). Thus, we can select either c1.Name, c1.City or c2.Name, c2.City. The resulting table will be neater than the original one (showing the same attributes twice).

  • Join more than 2 tables

In fact, we can join multiple tables! For instance, we have 3 tables in the database. The first one stores the personal information of customers; the second one stores the order information of customers; the last one stores the details of products. We might want to match 3 tables to see the product preference of customers in a particular age range, so we need to match 3 tables. To illustrate, matching the first and second table is not enough since the second table doesn’t contain product details; matching the second and third table is not enough since the second table doesn’t contain customer details (i.e., age). Matching the first and third table is not feasible since there is not common attribute.

You might wonder why don’t we store all the data in a table. Actually, doing that could cause redundancy and anomalies, so we need to normalize tables (This is however the topic for another day). The key point here is that we can join as many tables as we wish, and the syntax goes like SELECT * FROM table1 INNER JOIN (table2 INNER JOIN table3 ON table2.column1 = table3.column2) ON table1.column1 = table2.column2;.

Congratulations, you’ve come so far! In the previous article and this article, we went through SELECT, ORDER BY, LIMIT, DISTINCT, WHERE, GROUP BY, HAVING, and INNER JOIN, and these keywords/clauses should be enough for you to extract information from a database. However, practice makes perfect. You can find tons of SQL practice online, go practice so you can combine the clauses fluently. Good luck!

💛 If you like this article, make sure to follow me! It really encourages me and motivates me to keep sharing. Thank you so much.

References

  1. https://www.w3schools.com/sql/sql_groupby.asp
  2. https://www.w3schools.com/sql/sql_having.asp
  3. https://www.w3schools.com/sql/sql_join_inner.asp

--

--

Little Dino
CodeX
Writer for

Welcome to my little world! I LOVE talking about machine learning, data science, coding, and statistics!