Grokking GROUP BY

Photo by Oleg Ivanov on Unsplash

Please note: the examples in this article were created with PostgreSQL, and other varieties of SQL may produce slightly different results.

Introduction

Upon first encountering the SQL GROUP BY clause, it appears simple enough. It’s easy to understand that GROUP BY forms groups of rows based on columns which share a common value. It’s also easy to understand that those groups can be used to refine our queries and produce more meaningful results based on those groups. However, when we begin using GROUP BY, we may find that it is also very easy to produce errors. There are certain requirements when using GROUP BY and if these are not sufficiently understood (or grokked) errors become likely occurrences. In this article, I will attempt to demonstrate what these requirements are, why they are necessary, and how GROUP BY can be used with ease to attain powerful results. Let’s begin!

What does GROUP BY do?

Most simply put, the GROUP BY clause makes groups of rows where values in the specified columns are the same. Those groups are then used in the final output of the query.

In the results, a row can contain values from grouped columns and ungrouped columns which belong to the same group. It is important to note that the output can only contain one value for each column in the group. The value for a grouped column will be the shared value of each row in the group. The value for an ungrouped column will be the single value that is the result of aggregating all of the values in the column within the group.

If this is already starting to get confusing, then let’s see how it works in action.

For the rest of this article, we will be working with the following data set about a number of pets which have various traits and belong to various categories. Right now, the data is quite messy, but we will see how GROUP BY can be used to make sense of our data in many ways.

Let’s start with a very simple example and return a list of all the species:

SELECT species FROM pets GROUP BY species;
species
-----------
chameleon
dog
cat
parrot
tortoise
iguana
owl
(7 rows)

We know that our data set has 10 rows, and the results of our query returned 7 rows. That’s because the query produced 7 groups in which the value in the species column is the same. We can see this more clearly with the following visualization of the grouped data.

Notice how the values in the grouped column are condensed to a single value in the output, e.g. ‘dog’ occurs twice in the species column, but only once in the output.

Let’s also take a look at what happens if we group the data set by class:

SELECT class FROM pets GROUP BY class;
class
— — — — -
mammal
reptile
bird

So far so good. Let’s try to return the names of all of the pets grouped by class:

SELECT class, name FROM pets GROUP BY class;ERROR: column “pets.name” must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT class, name FROM pets GROUP BY class;
^

This is where we start to run into trouble. Our error message reveals that the name column needs to be included in the GROUP BY clause, or be used in an aggregate function. The reason for this is that each column in the output must only contain one value per group. Since there are multiple names, we get an error. To understand this better, lets compare the situation with a real world example.

A Simple Analogy

Let’s say that a teacher divides their class into groups of students based on their eye colors.

The teacher then says, ‘Ok, brown-eyed kids, how old are you?’

‘…’

The kids freeze. Some kids are 8, and some are 9. They don’t know how to respond. It’s impossible for them to choose a single age for all of the kids in the group.

However, one brave and clever student does respond, ‘Teacher, we need more information. Do you mean the average of our ages, or all of our ages in a list?’

You can probably see where I’m going with this. SQL runs into the same problem that the students did. If the data is not grouped, or aggregated, an error will be thrown, just like the brave and clever student asking for better instructions.When there are multiple values in a group, SQL can not choose a single value to represent, and needs to be told in exactly what way the values should be combined together to produce one value. This is precisely what aggregate functions do.

Aggregate Functions

An aggregate function is simply a function which takes multiple values, performs some operation on them, and then produces a single value as a result. This is exactly what is needed if we want to include some information from an ungrouped column in the output of a data set that has been grouped with GROUP BY. That way, wherever there is a group of rows, and some rows have multiple different values in a column, those different values can be combined in some way to form a single value for that group.

There are many different aggregate functions, which can be explored here, but a few very useful aggregate functions include:

  • count(column_name) — This will return the number of rows in the group.
  • sum(column_name) — This will add up the numeric values of all the rows in the specified column in the group.
  • string_agg(column_name, delimiter) — This will concatenate all of the string values separated by the delimiter of all of the rows in the specified column in the group.

With this understanding of aggregate functions, let’s try again to return all of the names of the pets grouped by class. Since we know we must combine them as a single value, let’s use the string_agg function to do just that.

SELECT class, string_agg(name, ‘, ‘) FROM pets GROUP BY class;    class | string_agg 
— — — — -+ — — — — — — — — — — — — — — — — -
mammal | coco, cinderella, django, donut
reptile | tim, charlie, ivan
bird | polly, olly, oswald
Multiple separate values aggregated with the string_agg function.

This time we are successful. The values in the name column have been aggregated into a single value (as shown by the highlighted blocks above) and can now be included in the results. The diagram above also shows how the grouped column class is also essentially aggregated from many identical values to a single value.

Let’s entrench our understanding with another quick example. This time, we want to discover the total weight of each class of animals. This time, we’ll use the sum function.

SELECT class, sum(weight) FROM pets GROUP BY class;    class | sum 
— — — — -+ — — -
mammal | 93
reptile | 258
bird | 12
Multiple separate values aggregated with the sum function.

The HAVING Clause

Let’s dive a little deeper and say that we want to find out all of the species of animals of which there are more than one pet. For this we’re going to need to filter out all of the species groups which have one or less pets. The WHERE clause won’t work because we need a way to count the animals in each group and aggregate functions are not valid in a WHERE clause. Luckily, there is another clause which is meant for just this purpose, the HAVING clause. The HAVING clause filters groups by comparing them to a conditional statement. If the comparison returns true, the group is included in the results, otherwise the group is not included. Let’s try it out.

SELECT species, count(id) FROM pets GROUP BY species HAVING count(id) > 1; species | count 
— — — — -+ — — —
dog | 2
cat | 2
owl | 2

In the diagram above, we can see that rows are first grouped by species, then filtered by the HAVING clause. the HAVING clause selects only groups which have a count of more than 1 row. All others are eliminated from the results.

Functional Dependency

Before this article comes to a close, there is one more subject to explore, the concept of functional dependency. The PostgreSQL documentation has this to say:

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

We have already explored the first condition in the above statement, namely that ungrouped columns cannot be included in the SELECT list except within aggregate functions. We have not yet looked at the second valid condition for an ungrouped column, that it may exist in the SELECT list if it is functionally dependent on one of the grouped columns. The above statement declares that ‘a functional dependency exists if one of the grouped columns is the primary key of the table containing the ungrouped column.’ That is likely not too hard to understand, but let’s make it even more clear with a quick example. Let’s select the names of the pets grouped by their id (the primary key of the table).

SELECT name FROM pets GROUP BY id;
name
— — — — — —
cinderella
oswald
charlie
polly
donut
tim
django
ivan
coco
olly
(10 rows)

That worked well. One thing that is important to note here is that the name column was included in the SELECT list without using an aggregate function or including it in the GROUP BY clause. This is possible because the name column (and all of the other columns in the table) is functionally dependent on the id column which was explicitly set as a primary key. Every primary key requires that it’s values be unique and not null, which means that if the table is grouped by these values, each group will be comprised of only one row as there cannot by definition be multiple rows in the table with the same value. If multiple tables are joined together, there can be multiple rows with the same primary key value, but all of the values in the original table will belong to only one primary key, which is another way to say that those values are functionally dependent on the primary key.

Let’s look at one final example before closing. This time we will try to select the id values of the pets table grouped by name. All of the names are different, so there should be only one id value per name. Will this work? Let’s try it out.

SELECT id FROM pets GROUP BY name;ERROR: column “pets.id” must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT id FROM pets GROUP BY name;
^

This doesn’t work. Even though we satisfy the condition that there can only be one value per group, the id column is not functionally dependent on the name column. The technical reason for this is that the name column is not the primary key of the table. Another reason is that the name column could have multiple values which are the same and then form groups of multiple rows. However, even if UNIQUE and NOT NULL constraints were set on the name column, it would not be the primary key and would not technically satisfy the definition of functional dependency.

Conclusion

In this article, I have tried to clarify some of the confusion that can arise when using the GROUP BY clause. The main thing to remember is that GROUP BY forms groups and each group can only contain a single value per column in the results. Grouped columns are automatically condensed into a single value, while ungrouped columns must either be aggregated into a single value or be functionally dependent on one of the grouped columns.

I hope this article enhanced your understanding of the GROUP BY statement and will help you to avoid some of those frustrating errors.

Thanks for reading and happy programming!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ian Austin

Ian Austin

Aspiring Software Engineer, currently studying at Launch School. “And in the end, the love you take, is equal to the love you make”