## Using Pokémon data to practice the GROUP BY clause

Jul 15 · 4 min read

The code I used for this blog can be found on my GitHub.

As we already know how to do basic queries in SQL (and if you don’t or need a refresher, please read: “A Gentle Introduction to SQL Basics in Python”), we can start utilizing more and more tools that SQL has to offer.

The `GROUP BY` clause is a pretty useful tool in SQL. With it, we can drill down even further into our data and arrange identical data into groups using some functions.

If a column has the same values in different rows it would put those rows in an individual bucket.

There are three important points to using the `GROUP BY` clause:

1. `GROUP BY` is used with the `SELECT` statement.
2. In the query, `GROUP BY` is placed after the `WHERE` clause.
3. In the query, `GROUP BY` is placed before `ORDER BY` (if it’s used).

Now that we have some rules in place, let’s set up the notebook!

# Setting Up

We’ll be using Pokémon data from a Kaggle dataset for these examples.

While this is to put a fun spin on SQL, these examples could easily be applied to more business-like solutions, such as grouping by age groups, income levels, location, etc.

Let’s begin with importing the libraries we need and loading the CSV file in Python:

Next, we start to clean the data before moving it into an SQLite instance:

`#find NaN valuesnan_rows = df[df.isnull().T.any().T]nan_rows.head()`

This will find any null values and return some of them (if any).

As all of the null values exist in the type2 column, we’ll change all of these null values to `‘none’`.

`#change all Type 2 NaN values to 'None':df['type2'] = df['type2'].fillna('none')`

Now, as SQL is very specific about the strings within the table (upper case and lower case matters in the rows), we’ll set everything to lower case.

`#change all strings within the dataframe to lower casedf = df.astype(str).apply(lambda x: x.str.lower())`

Now we’re ready to set this as an SQL database!

`#set the database for pokemondf.to_sql('pokemon', con=cnx, if_exists='append', index=False)#function for the SQL queries belowdef sql_query(query):    return pd.read_sql(query, cnx)`

Awesome, we can start running some SQL queries!

# General Syntax for GROUP BY’s

The general syntax for a GROUP BY function is:

`SELECT column_name(s), function_name(column_name)FROM table_nameWHERE conditionGROUP BY column_name(s)ORDER BY column_name(s);function_name: SUM(), AVG(), MIN(), MAX(), COUNT().table_name: name of the table. In this example, there is only the pokemon tablecondition: condition used.`

With this, we can reorganize and manipulate the data so that we can find better insights.

# Simple GROUP BY’s

Say, we want to only find the name, type, and total stats of the legendary Pokémon with the highest (max) stats. We would start with a simple `MAX()` query:

This would output Mega Mewtwo X, a psychic/fighting Pokémon with an impressive total stat of 780!

What if we wanted to know the Pokémon with the highest max stats, but in each type1 category? This is where the `GROUP BY` clause comes in handy:

Now, instead of just a single Pokémon as the output (Mega Mewtwo X), there are 14 legendary Pokémon.

The SQL query finds all legendary Pokémon and groups them into individual buckets based on the type1 column.

It separates them into dark, dragon, electric, fairy, etc., then the query returns the name, type1, type2, and total stats of the Pokémon in each bucket.

# GROUP BY and HAVING Clause

We use the `WHERE` clause to place conditions on columns, but what about placing conditions on groups? Introducing the `HAVING` clause!

The `WHERE` keyword can’t be used with aggregate functions, so we use the `HAVING` clause with `GROUP BY`'s.

We can use the `HAVING` clause to enter conditions to determine which group will be part of the final result.

Also, aggregate functions cannot be affected by the `WHERE` clause. Instead, the `HAVING` clause is used with aggregate functions if we want to apply conditions.

## The general syntax for a HAVING clause

`SELECT column_name(s)FROM table_nameWHERE conditionGROUP BY column_name(s)HAVING conditionORDER BY column_name(s);`

# GROUP BY With HAVING Clause Example

Imagine that we wanted to find the count of individual Pokemon, the type1, the minimum and maximum total stats, and average HP of all Pokémon grouped by type1, and only include the groups of Pokémon whose summed HP is higher than 4000:

This helps us determine which groups of Pokémon have the highest amount of HP in their class, whilst cutting out the groups that are lower than the 4000 cap we set.

If we wanted high HP Pokémon to choose from, we’d want to go with those of the normal type1 as they have the highest average HP (77.28) and can definitely take a hit or two.

The `HAVING` clause really helps to thin out our data to more useful and insightful stuff!

# Conclusion

Writing this blog helped me understand how to utilize `GROUP BY`’s and I hope it helps you as well! Stay tuned for the next SQL blog coming soon!