Groupby in Pandas
Grouping the dataset and applying a function for each group is one of the important steps of data analysis. I’ll cover as following topics:
- What is groupby and how to work?
- Iterate on groups
- Data grouping on subcolumns
- Data grouping with dictionary and series
- How to apply a function to grouped data?
- How to group with hierarchical indexing?
- Practice with a real dataset
Before getting started don’t forget to subscribe to our youtube channel where I create content about ai, data science, machine learning, and deep learning.
Let’s get started.
What is groupby?
The groupby is used to arrange identical data into groups. A groupby operation involves as following steps,
- The dataset is divided into groups using the key column.
- One function is applied for each group.
3- The results of the applied functions are combined and a new table is created.
Let’s create a dataset to show these steps. First, let’s import Pandas and Numpy.
Now, let’s create a dataset named df.
In this dataset, the key1 and key2 columns are key columns. These key columns consist of categories. For example, the key1 consists of two categories, a and b. The key2 consists of three categories, one, two, and three. Let’s calculate the mean of data1 for the categories of key1.
Let’s take a look at this object.
There is no result since we only created a group object. Let’s apply various functions to this object. For example, you can use the mean method to calculate the mean of groups.
Here you go. Notice that the key1 column was divided into categories a and b, and then the mean of each category was found. You can also group them into two key columns.
You can use the unstack method to see these results in a table
You can also make calculations for data1 and data2 together. Let me calculate the mean of data1 and data2 for the categories of key1.
You can calculate the means with both keys.
Iterating over Groups
You can iterate on groupby. For example, let’s create a for loop.
You can group data for both keys.
You can split the dataset based on any group in the key column. Let me show you this.
Now, let’s take a look at the group named a.
Selecting a Column or Subset of Columns
When making calculations for grouped data, you can select a column. For example, let’s find the means for the data1 column using key1 and key2.
Grouping with Dicts and Series
You can also use a dictionary or series. To show this, let’s create a dataset.
Let’s map the columns in the dataset using the dictionary.
Now, let’s create groups using this label variable. By default, grouping is done according to the row. You use the axis = 1 parameter for the column.
Let’s find the sum of these groups.
You can do the same for the series. To do this, first, let’s convert the label variable to the series.
Now, let’s create groups with variable s and calculate group numbers for each row.
Grouping with Functions
Using Python functions is a more productive way than mapping with series and dictionary. Now, let’s find the sum of the fruits.
Grouping by Index Levels
To show how to work with hierarchical indexes let’s create a dataset.
Now, let’s name the column indexes.
Now, let’s group the data by letter index.
Practice with Real DataSet
Let’s practice with a real dataset. First of all, let’s import the dataset.
You can find the dataset here. Let’s see the first five rows of the dataset.
Next, let’s take a look at the structure of the variables in the dataset.
Let’s find summary statistics for numeric variables in this dataset with the describe () method.
Let’s find the means for the Global_Sales column.
Now, let’s create a group object with the genre
Let’s want to see the number of global sales by type.
Let’s see summary statistics with global sales.
Let’s filter to see the mean by just one type.
Let’s find the mean of all numeric type columns by genre.
Let’s see a bar plot of global sales. First, let’s use the %matplotlib inline magic command to see the graph between lines.
Let’s plot the graph.
Let’s see the means of the game sales in America, Europe, and Japan by genre as a bar plot.
That’s it. In this post, I explained groupby in Pandas. I hope you enjoy this post. Thanks for reading. You can find the notebook here.
8 Best Seaborn Visualizations
Hands-on statistical plots with Python Seaborn using the penguin dataset.
PRACTICAL DATA ANALYSIS with PANDAS
In my last post, I mentioned working with data in Pandas library. One of Python’s most important libraries is pandas…
If this post was helpful, please click the clap 👏 button below a few times to show me your support 👇