Introduction to SQL Using Python: Computing Statistics & Aggregating Data

Erika D
Analytics Vidhya
Published in
13 min readNov 3, 2019

In my last blog, I discussed how to filer data in SQL using the WHERE statement. This blog is a tutorial on how to compute statistics and aggregate data with SQL. We will discuss the following:

  • COUNT Function
  • AS Command
  • SELECT DISTINCT
  • MAX & MIN Functions
  • AVG & SUM Functions
  • Using the WHERE Statement with Computed Statistics
  • ORDER BY Statement
  • GROUP BY Statement
  • HAVING Statement

In this tutorial we will continue working with the same database from my last blog, which can be downloaded here. We will be using the Matches table and the Teams table from this database. If you haven’t seen my last blog post, I recommend reading it before this tutorial. It covers filtering data in SQL with the WHERE statement and also reviews the various columns in the Matches table. The columns in the Teams table are described as the follows from https://www.kaggle.com/laudanum/footballdelphi:

  • Season (str): Football season for which the data is valid
  • TeamName (str): Name of the team the data concerns
  • KaderHome (str): Number of Players in the squad
  • AvgAgeHome (str): Average age of players
  • ForeignPlayersHome (str): Number of foreign players (non-German, non-English respectively) playing for the team
  • OverallMarketValueHome (str): Overall market value of the team pre-season in EUR (based on data from transfermarkt.de)
  • AvgMarketValueHome (str): Average market value (per player) of the team pre-season in EUR (based on data from transfermarkt.de)
  • StadiumCapacity (str): Maximum stadium capacity of the team’s home stadium

To begin, we will download the necessary libraries, sqlite3 and pandas.

Import necessary libraries

Next, you will need to connect to the database and create a cursor object.

Connect to database & create cursor object

The following is format we will be using to run our SQL queries in Python.

Format for SQL queries in Python

COUNT Function

The first function we will look at is the COUNT function. Previously we have either put a single asterisk in the SELECT statement or we had listed the names of the columns we wanted to be returned. The COUNT function can also be listed in the SELECT statement and will count the number of rows specified. The query below uses the function COUNT(*) in the SELECT statement. COUNT(*) will count all the rows from the table specified. The below query counts all the rows in the Teams dataset.

SELECT COUNT(*) FROM Teams;

There are total of 468 rows included in the Teams dataset. We can also insert a column name instead of an asterisk between the parentheses of the COUNT function, for example, COUNT(Season). The function will then count all the rows in the Season column where there is a non-null value. If a value of Season is null for any row, then that row will not be counted. If there are no null values in the Season column then the following function will return 468, the same number as the total number of rows in the Teams dataset.

SELECT COUNT(Season) FROM Teams;

Great, the number of rows with non-null values is 468, the same as the total amount of rows included in the Teams dataset. That means there are no null values. To practice using the COUNT function, count the number of rows in the Matches dataset and compare your query to the one below:

SELECT COUNT(*) FROM Matches;

There should be 24,625 rows in the Matches dataset. Check to see if there are any missing values in the column Season from the Matches dataset and compare your query to the one below:

SELECT COUNT(Season) FROM Matches;

There are 24,625 rows with non-null values in the Season column in the Matches dataset. Since this is the same as the total number of rows in the Matches dataset, there are no missing/null values in the Season column.

AS Command

When we get our returned results using the COUNT function, the name of our returned column is COUNT(Season). This column name isn’t very clear. This would be a good time to rename the column. We can rename a column using the AS command. The AS command is used to rename a column or even table with an alias. To see how the AS command is used view the query below:

SELECT COUNT(Season) AS Num_of_NonNull_Season_Rows FROM Matches;

We used the query you had just practiced on your own. The difference is that the column COUNT(Season) was renamed to the alias, Num_of_NonNull_Season_Rows using the AS command. To practice using the AS command on your own, write a query that counts all the rows where TeamName is not a null value from the Teams dataset, rename the column to Num of_NonNull_TeamName_Rows. Compare your query to the one below:

SELECT COUNT(TeamName) AS Num_of_NonNull_TeamName_Rows FROM Teams;

SELECT DISTINCT

With the COUNT function, we were able to see how many rows were listed in both the Teams and Matches datasets. We also saw the number of rows that did not have null values in the Seasons column for both datasets. What if we want to know the distinct, unique values in the Seasons column in the Teams dataset. We can’t use COUNT(Season), because that just gives us the number of rows that have a specified value in the Season column. Instead we will use the DISTINCT clause in the SELECT statement. The below query will show us all of unique values that are included in the Season column in the Teams dataset.

SELECT DISTINCT Season FROM Teams;

Practice selecting all the unique AwayTeams there are in the Matches dataset and compare your query to the one below:

SELECT DISTINCT AwayTeam FROM Matches;

Using COUNT & DISTINCT Together

We just saw how to get all the unique AwayTeam names from the Matches dataset above. What happens when we want to know how many unique values are returned? To get the count of the unique AwayTeam names, we can use the COUNT function together with the DISTINCT clause. The below query will find all the distinct AwayTeam names in the Matches dataset and then return the count or number of rows that are included in distinct AwayTeam names.

SELECT COUNT (DISTINCT AwayTeam) From Matches;

To practice using the COUNT function and DISTINCT clause together, query the number of distinct Seasons included in the Teams dataset and rename the column to Num_of_Seasons. Compare your query to the one below:

SELECT COUNT(DISTINCT Season) AS Num_of_Seasons FROM Teams;

MAX & MIN Functions

We will continue to look at how data can be computed in the SELECT statement. We will now look at how to use the MAX function and MIN function. The MAX function will return the largest value from a specified column and the MIN function will return the smallest value from a specified column. Perhaps we want to know what is the largest stadium capacity that is included in the Teams dataset. To do this, we can use the query below:

SELECT MAX(StadiumCapacity) AS Largest_StadiumCapacity FROM Teams;

In the example above, 81,359 was the largest value in the StadiumCapacity column in the Teams dataset. To see what the smallest value is in the StadiumCapacity column, we will use the MIN function.

SELECT MIN(StadiumCapacity) AS Smallest_StadiumCapacity FROM Teams;

We can see that the smallest stadium capacity included in the Teams dataset is 15,000. To practice using the MAX and MIN functions, query, the largest number of players in a squad, KaderHome, and rename the column to Max_Players. In the same query, return the smallest number of players in a squad and rename that column to Min_Players and compare your query to the one below:

SELECT MAX(KaderHome) AS Max_Players, MIN(KaderHome) AS Min_Players FROM Teams;

AVG Function & SUM Function

Now we will look at two more functions that can be used in the SELECT statement. The AVG function will return the average value of a specified, numeric column. The SUM function will return the sum of a specified, numeric column. We will continue to look at the StadiumCapacity column in the Teams dataset. The below query shows the average stadium capacity in the Teams dataset.

SELECT AVG(StadiumCapacity) AS Average_StadiumCapacity FROM Teams;

There are a lot of numbers following the decimal. If we want to round to the hundredth place we can add a ROUND clause around our AVG function, like the query below:

SELECT ROUND(AVG(StadiumCapacity), 2) AS Average_StadiumCapacity FROM Teams;

A quick note on the ROUND function. The format is as follows: ROUND(number to be rounded, number of decimal places to round number to). The number we wanted to round was the AVG(StadiumCapacity) and we wanted to round to the nearest hundredth, or 2__ numbers following the decimal, therefore the format we used was __ROUND(AVG(StadiumCapacity), 2). To read more about the ROUND function visit, https://www.w3schools.com/sql/func_sqlserver_round.asp.

The SUM function will return the sum of a specified column. For example, if we wanted to see the total number of home goals scored in the Matches dataset, we could use the following query:

SELECT SUM(FTHG) AS Total_HomeGoals FROM Matches;

To practice using the AVG and SUM functions on your own, query from the Teams dataset the average overall market value of the team pre-season in EUR, OverallMarketValueHome, and round the value to the nearest hundredth and then label that column, Avg_Market_Val. In the same query, return the sum of the average age of players, AvgAgeHome, and rename that column to Avg_Age. Compare your query to the one below:

SELECT ROUND(AVG(OverallMarketValueHome), 2) AS Avg_Market_Val, SUM(AvgAgeHome) AS Avg_Age FROM Teams;

Using the WHERE Statement with Computed Statistics

If you did not read my last blog post about filtering data in the WHERE statement, now may be a good time to take a look. We are now going to start computing statistics for filtered data.

What if you wanted to know what how many goals Man United scored during home games on average. To get this answer, we could use the following query:

SELECT ROUND(AVG(FTHG), 2) AS Avg_HomeGoals FROM Matches WHERE HomeTeam = ‘Man United’;

By adding the WHERE statement, we are telling our query that we only want to filter data that has a HomeTeam of ‘Man United’. To understand this better, we will briefly discuss the SQL order of operations in relation to this query. Although the SELECT statement is the first statement in our SQL query, the first statement to be executed is the FROM statement. Our FROM statement above tells SQL that we are using the Matches dataset as our table. The next statement executed is the WHERE statement. The WHERE statement filters our dataset to include only the information we want, in this case, we only wanted the rows where HomeTeam = ‘Man United’ to be included. Lastly, the SELECT statement is executed, it returns the data in the form we specify. In this case, we want to return the rounded average value of home goals scored from the data the SQL query filtered through earlier during the query execution. Practice aggregating and using the WHERE statement by writing a query that shows the sum of squad players for all teams during the 2014 season from the Teams dataset. Rename the sum of squad players column to Total_Soccer_Players. When you are done compare your query to the one below:

SELECT SUM(KaderHome) AS Total_Soccer_Players FROM Teams WHERE Season = 2014;

ORDER BY Statement

The ORDER BY statement is used to sort returned data in either ascending or descending order given a specified column. Without an ORDER BY statement, the data will not be sorted in any particular order. If you look at the query below, the distinct average age of players for teams during the 2015 season is not returned in any specific order:

SELECT DISTINCT AvgAgeHome FROM Teams WHERE Season = 2015;

We will now add an ORDER statement to the same query. The returned results will now be in ascending order, with the lowest distinct average age of players at the top of the returned results and the highest distinct average age of players at the bottom of our returned results.

SELECT DISTINCT AvgAgeHome FROM Teams WHERE Season = 2015 ORDER BY AvgAgeHome;

We can also return the results of this query in DESCENDING order by AvgAgeHome. The following query is the same query as the one above but this one includes a DESC statement after the ORDER BY AvgAgeHome statement. This will sort the returned results in DESCENDING order. The highest distinct average age of players on a team will be returned at the top of the results and the lowest distinct average age of players will be returned at the bottom of the returned results. To see the DESC clause in action, view the query below:

SELECT DISTINCT AvgAgeHome FROM Teams WHERE Season = 2015 ORDER BY AvgAgeHome DESC;

Our results were now returned in DESCENDING order by the distinct average age of players on teams during the 2015 season. If we do not put the DESC clause at the end of an ORDER BY statement, the results will be returned in ASCENDING order by default. You can put ASC at the end of an ORDER BY statement to return results in ascending order but it is not necessary. Practice using the ORDER BY statement by writing a query that returns the HomeTeam, FTHG (number of home goals scored in a game) and FTAG (number of away goals scored in a game) from the Matches table. Only include data from the 2010 season and where Aachen is the name of the home team. Return the results first by the number of home goals scored in a game in descending order. For results that share the same number of home goals scored, order those results by the number of away goals scored in a game in ascending order. Compare your query with the one below:

SELECT HomeTeam, FTHG, FTAG FROM Matches WHERE HomeTeam = ‘Aachen’ AND Season = 2010 ORDER BY FTHG DESC, FTAG ASC;

GROUP BY Statement

Finally we have reached the GROUP BY function. The GROUP BY function will group together the rows that have matching values in a specified column. To understand how the GROUP BY statement works, we will discuss the query below:

SELECT HomeTeam, COUNT(FTR) AS Total_Home_Wins FROM Matches WHERE FTR = ‘H’ AND Season = ‘2016’ GROUP BY HomeTeam ORDER BY COUNT(FTR) DESC;

This query returns the total number of home games each team won during the 2016 season in descending order of number of home games won. There are two variables in our SELECT statement, HomeTeam and COUNT(FTR). The first variable is an unaggregated column from the Matches dataset, the second variable is an aggregated column from the Matches dataset. The WHERE statement says to only use data where FTR = ‘H’ (outcome of a game is a home win) and Season = 2016. The GROUP BY groups the rows together by HomeTeam. The GROUP BY statement tells the query to combine all the rows with the same home team name together. The result is that each home team is grouped together and every time they won a home game, it was added to the total on the left for the final result.

GROUP BY statements are typically used with aggregate functions like the ones we covered earlier, COUNT, MAX, MIN, AVG, and SUM_. Try practicing using a __GROUP BY statement by writing a query that shows the average number of foreign players for each team from the Teams dataset. Rename the column of average number of foreign players to Avg_Num_Foreign_Players and round the column to the nearest hundredth. Return your results by average number of foreign players in ascending order. Compare your query to the one below:

ELECT TeamName, ROUND(AVG(ForeignPlayersHome),2) AS Avg_Num_Foreign_Players FROM Teams GROUP BY TeamName ORDER BY ROUND(AVG(ForeignPlayersHome),2) ASC;

HAVING Statement

The HAVING statement filters groups. It is always used right after the GROUP BY statement. It is like the WHERE statement but is only used to filter aggregated rows. To see how it works, we will run the same query we did before but this time we only want to show the teams where the average number of foreign players is greater than or equal to 15.

SELECT TeamName, ROUND(AVG(ForeignPlayersHome),2) AS Avg_Num_Foreign_Players FROM Teams GROUP BY TeamName HAVING ROUND(AVG(ForeignPlayersHome),2) >= 15 ORDER BY ROUND(AVG(ForeignPlayersHome),2) ASC;

Try practicing using the HAVING statement to query the sum of the average market value (per player) of the team pre-season in EUR (AvgMarketValueHome) for each season where the sum of the average market value (per player) of the team pre-season in EUR is is equal to or less than 65,000,000. Compare your query to the one below:

SELECT Season, SUM(AvgMarketValueHome) FROM Teams GROUP BY Season HAVING SUM(AvgMarketValueHome) <= 65000000;

We have reached the end of this tutorial. We have covered the following topics:

  • COUNT Function
  • AS Command
  • SELECT DISTINCT
  • MAX & MIN Functions
  • AVG & SUM Functions
  • Using the WHERE Statement with Computed Statistics
  • ORDER BY Statement
  • GROUP BY Statement
  • HAVING Statement

I encourage you to keep playing around with the functions and statements we have covered in this tutorial to gain a deeper understanding of how they work.

--

--