Using Pokémon data to practice the GROUP BY clause

Randy Macaraeg
Jul 15 · 4 min read
Photo by Jay on Unsplash

The code I used for this blog can be found on my GitHub.

As we already know how to do basic queries in SQL (and if you don’t or need a refresher, please read: “A Gentle Introduction to SQL Basics in Python”), we can start utilizing more and more tools that SQL has to offer.

The GROUP BY clause is a pretty useful tool in SQL. With it, we can drill down even further into our data and arrange identical data into groups using some functions.

If a column has the same values in different rows it would put those rows in an individual bucket.

There are three important points to using the GROUP BY clause:

  1. GROUP BY is used with the SELECT statement.
  2. In the query, GROUP BY is placed after the WHERE clause.
  3. In the query, GROUP BY is placed before ORDER BY (if it’s used).

Now that we have some rules in place, let’s set up the notebook!


Setting Up

We’ll be using Pokémon data from a Kaggle dataset for these examples.

While this is to put a fun spin on SQL, these examples could easily be applied to more business-like solutions, such as grouping by age groups, income levels, location, etc.

Let’s begin with importing the libraries we need and loading the CSV file in Python:

Next, we start to clean the data before moving it into an SQLite instance:

#find NaN values
nan_rows = df[df.isnull().T.any().T]
nan_rows.head()

This will find any null values and return some of them (if any).

As all of the null values exist in the type2 column, we’ll change all of these null values to ‘none’.

#change all Type 2 NaN values to 'None':
df['type2'] = df['type2'].fillna('none')

Now, as SQL is very specific about the strings within the table (upper case and lower case matters in the rows), we’ll set everything to lower case.

#change all strings within the dataframe to lower case
df = df.astype(str).apply(lambda x: x.str.lower())

Now we’re ready to set this as an SQL database!

#set the database for pokemon
df.to_sql('pokemon', con=cnx, if_exists='append', index=False)
#function for the SQL queries below
def sql_query(query):
return pd.read_sql(query, cnx)

Awesome, we can start running some SQL queries!


General Syntax for GROUP BY’s

The general syntax for a GROUP BY function is:

SELECT column_name(s), function_name(column_name)
FROM table_name
WHERE condition
GROUP BY column_name(s)
ORDER BY column_name(s);

function_name: SUM(), AVG(), MIN(), MAX(), COUNT().
table_name: name of the table. In this example, there is only the pokemon table
condition: condition used.

With this, we can reorganize and manipulate the data so that we can find better insights.


Simple GROUP BY’s

Say, we want to only find the name, type, and total stats of the legendary Pokémon with the highest (max) stats. We would start with a simple MAX() query:

This would output Mega Mewtwo X, a psychic/fighting Pokémon with an impressive total stat of 780!

What if we wanted to know the Pokémon with the highest max stats, but in each type1 category? This is where the GROUP BY clause comes in handy:

Now, instead of just a single Pokémon as the output (Mega Mewtwo X), there are 14 legendary Pokémon.

The SQL query finds all legendary Pokémon and groups them into individual buckets based on the type1 column.

It separates them into dark, dragon, electric, fairy, etc., then the query returns the name, type1, type2, and total stats of the Pokémon in each bucket.


GROUP BY and HAVING Clause

We use the WHERE clause to place conditions on columns, but what about placing conditions on groups? Introducing the HAVING clause!

The WHERE keyword can’t be used with aggregate functions, so we use the HAVING clause with GROUP BY's.

We can use the HAVING clause to enter conditions to determine which group will be part of the final result.

Also, aggregate functions cannot be affected by the WHERE clause. Instead, the HAVING clause is used with aggregate functions if we want to apply conditions.

The general syntax for a HAVING clause

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);

GROUP BY With HAVING Clause Example

Imagine that we wanted to find the count of individual Pokemon, the type1, the minimum and maximum total stats, and average HP of all Pokémon grouped by type1, and only include the groups of Pokémon whose summed HP is higher than 4000:

This helps us determine which groups of Pokémon have the highest amount of HP in their class, whilst cutting out the groups that are lower than the 4000 cap we set.

If we wanted high HP Pokémon to choose from, we’d want to go with those of the normal type1 as they have the highest average HP (77.28) and can definitely take a hit or two.

The HAVING clause really helps to thin out our data to more useful and insightful stuff!


Conclusion

Writing this blog helped me understand how to utilize GROUP BY’s and I hope it helps you as well! Stay tuned for the next SQL blog coming soon!

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade