In my last post, I mentioned reshaping and pivoting in Pandas library. Categorizing the data set and applying a function for each group is one of the important stages of data analysis. After making the data set suitable for analysis, calculations are needed to calculate and tabulate or visualize group statistics. Pandas has the groupby for these operations.
In summary, I am going to talk about as follows topics;
- What is grouby and how to use it?
- How to iterate on groups?
- How are operations performed by selecting sub-columns?
- How are grouping operations done with Dictionary and Series?
- How are functions applied to groups?
- How to do group operations with hierarchical indexing?
- How are the described operations applied to the real data set?
Before starting the topic, our Medium page includes posts on data science, artificial intelligence, machine learning, and deep learning. Please don’t forget to follow us on Medium 🌱 to see these posts and the latest posts.
Let’s get started.
What is GroupBy?
First, let’s explain how the groupby mechanism works. Suppose you have a key column in the data set and you want to group this column. To do this;
1- The data set is divided into groups according to the categories in the key column.
2- One function is applied for each group. For example, the mean of each group is calculated.
3- The results of the applied functions are combined and a new table is created.
Let’s create a data set in the form of a table to show these stages. First, let’s import Pandas and Numpy libraries.
Now let’s create a data set named df.
In this data set, key1 and key2 columns are key columns. These key columns consist of categories. For example, key1 consists of two categories, a and b. key2 consists of three categories one two three. Let’s want to calculate the mean of data1 for the categories of key1. There are a number of ways to do this. One of them reaches data1 and then we write key1 column into groupby. Then let’s assign this command to the group variable.
When we print the group object on the screen, there is no result. Currently, we only created a group object.
We can apply various functions to this object with group information. For example, we can use groupby’s mean () method to calculate the mean of groups.
Notice that the key1 column was divided into categories a and b, and then the average of each category was found. The result found was translated in Series structure with key1 index. If we want, we can group them according to two key columns.
Here, the results are presented in a hierarchical manner with key1 and key2 indices. If we want to see these results in a table, we use the unstack() method.
We have done various calculations for data1 so far, if we want, we can make calculations for data1 and data2 together. For example, let’s calculate the mean of data1 and data2 for the categories of key1.
We came across means for the categories of key1 in data1 and data2. If we want, we can calculate the means according to both keys.
Iterating over Groups
We can iterate on groupby. For example, let’s create a for loop.
We may want to see group data for both keys. Let’s be careful, the first item I will write key values for here should be a tuple.
We operate according to the groups of key columns in the super data set. Can we break the data set according to any group in the key column? We can do this again using groupby.
We can take the part we want from the data set that has been translated into dictionary structure. For example, let’s print group a on the screen.
Selecting a Column or Subset of Columns
When making calculations for grouped data, we can select the column we want. For example, let’s find the averages for the data1 column into the groups of key1 and key2.
Grouping with Dicts and Series
Groupings have so far been composed of array. Groupings can also be made up of a dictionary or Series. Let’s create a data set now.
Now suppose we want to group the columns. Let’s map the columns in the data set using the dictionary data structure.
Now let’s create groups using this tag variable. By default, grouping is done according to the lines. We use the axis = 1 parameter since we will group by columns here.
The match was for yellow and green. The label variable also has a value of e: purple. This value was not received where there is no value matching e. Let’s find the sum of these groups.
Thus, the columns were grouped by yellow and green, and the group totals for each row were found. We can do the same for Series data structures. First, let’s convert the label variable to Series structure.
Now let’s create groups with variable s and calculate group numbers for each row.
Grouping with Functions
Using Python functions is a more productive way than mapping with Series and dictionary. We can call any function by typing it in groupby as a key. Returning values will be group names. Now let’s find the sum of the fruits in the fruit data set by the number of letters.
Notice that each fruit was grouped according to the number of letters and the totals of the grouped data were found. For example, there is a three-letter fruit and this has become a group. There are two fruits with four letters, they are grouped together. There is a cherry fruit with five letters. This became a group. Total values were found according to these groups.
Grouping by Index Levels
Finally, let’s examine how to group with hierarchical indexes. For this, let’s create a data set with hierarchical indexes.
Now let’s name the column indexes.
Note that the columns are made up of hierarchical indexes. Now let’s group the data by letter index. Since we will group by columns, we use the axis = 1 parameter.
Application with Real Data Set
Let’s show what we learned on a real data set. First of all, let’s import the data set showing the worldwide game sales since 1980.
You can access this data set from here. Let’s see the first five rows of the data set.
Next, let’s look at the structure of the variables in the data set.
We can find summary statistics for numeric variables in this data set with the describe () method. While finding summary statistics, I also use the dropna () method to remove rows with missing data. The blog post about missing data can be found here.
These summary statistics are a good indicator to understand the general characteristics of the data set. Video games, for example, go back to 1980. Finally, we see that it is the game of 2020. Let’s want to find the means for the Global_Sales column.
Okay, what do we do if we want to see global sales by species? We use the groupby method for this. Now let’s create a group object according to the genre, the game type.
We have now created the group object by type. Let’s want to see the number of global sales by type.
Let’s want to see summary statistics according to global sales.
Summary statistics came up as a table. Let’s filter to see the average by just one type.
We had created groups by genre. Let’s want to find the average of all numeric type columns by genre.
Let’s want to see a bar chart of global sales means by genre. First, let’s use the %matplotlib inline magic command to see the graph between lines.
Let’s plot the graph.
If you pay attention, the average of the platform games is the highest and the adventure games are the least sold. Let’s see the means of the game sales in America, Europe and Japan by genre as a bar chart.
Notice that America is shown in blue, Europe in orange, and Japan in green. In average sales, we see that the most sales are in the USA according to the types and the least in Japan in general. It can be used in different methods. I have shown some methods to give an idea.
That’s it. I hope you enjoy this post. You can access the notebook I used for this post on our GitHub page. 🚩
If you haven’t read it, I strongly recommend you to read the following articles about Pandas. 👇👇👇
- Introduction to the Pandas
- Indexing-Selection-Filtering in Pandas
- Important Methods in Pandas
- Arithmetic Operations in Pandas
- Sorting and Ranking in Pandas
- Summarizing And Computing Descriptive Statistics in Pandas
- Reading and Writing Data in Pandas
- How to Fix Missing Data in Pandas
- Data Transformation in Pandas
- Hierarchical Indexing in Pandas
- Combining And Merging Datasets in Pandas
- Reshaping And Pivoting in Pandas
- Groupby in Pandas
- Working With Groupby in Pandas
- Pivot Tables in Pandas
- Categorical Data in Pandas
- Working With Text Data in Pandas
- Practical Data Analysis with Pandas
- Multiple Selecting-Filtering in Pandas