Pandas:The ML Building Block-Part 2

Santosh Kumar

Published in

Xebia Engineering Blog

5 min readMay 16, 2021

Part-1

Let’s begin from where we left in the 1st part.

In this part we will discuss about slicing/dicing, merging/concatenating, aggregation, and applying function.

We follow the same rule as last time, for better understanding read the comments in the code block.

Jupyter Notebook

Let’s start with loading the data again:

Subsetting Rows Based on Conditions

Often, we want to select rows which satisfy some given conditions. For e.g. select all the orders where the Sales > 3000, or all the orders where 2000 < Sales < 3000 and Profit < 100.

Arguably, the best way to do these operations is using df.loc[], since df.iloc[] would require us to remember the integer column indices, which is tedious.We will also discuss some other ways of doing the same job.

Let’s see some examples.

In the above we have used the df.loc[] but we can do it in more elegant way stated below.

We can combine multiple operations using the & (and) operator and |(or) operator.Along with this we can apply the == (equality) and !=(unequal) operators.

Merging and Concatenating Dataframes

In this section, we will merge and concatenate multiple dataframes.

Merging is one of the most common operations we will do, since data often comes in various files.

In our case, we have sales data of a retail store spread across multiple files. We will now work with all these data files and learn to:

Merge multiple dataframes using common columns/keys using pd.merge()
Concatenate dataframes using pd.concat()

We will start with loading multiple files so that they can be merged to make a master dataframe.

Merging Dataframes

Say we want to select all orders and observe the Sales of any segment of customer . Since customer segment details are present in the dataframe customer_df, we will first need to merge it with market_df.

We can see that after merging we get a total of 22 columns in master_df. Similarly, we can perform left, right and outer merges (joins) by using the argument how = ‘left’ / ‘right’ / ‘outer’.

Concatenating Dataframes

Concatenation is much more straightforward than merging. It is used when you have dataframes having the same columns and want to append them (pile one on top of the other), or having the same rows and want to append them side-by-side.

Concatenating Dataframes having the same column

Say you have two dataframes having the same columns, like :

Concatenating Dataframes having the same rows

We may also have dataframes having the same rows but different columns (and having no common columns). In this case, we may want to concat them side-by-side. For e.g.:

Grouping and Summarising Dataframes

Grouping and aggregation are some of the most frequently used operations in data analysis, especially while doing exploratory data analysis (EDA), where comparing summary statistics across groups of data is common.

E.g. in the retail sales data we are working with, we may want to compare the average sales of various regions, or compare the total profit of two customer segments.

Grouping analysis can be thought of as having three parts:

Splitting the data into groups (e.g. groups of customer segments, product categories, etc.)
Applying a function to each group (e.g. mean or total sales of each customer segment)
Combining the results into a data structure showing the summary statistics

Let’s work through some examples.

Step 1. Grouping using df.groupby()

Typically, we group the data using a categorical variable, such as customer segments, product categories, etc. This creates as many subsets of the data as there are levels in the categorical variable.

For example, in this case, we will group the data along Customer_Segment.

Step 2. Applying a Function

After grouping, you apply a function to a numeric variable, such as mean(Sales), sum(Profit), etc.

Step 3. Combining the results into a Data Structure

You can optionally show the results as a dataframe.

We can apply other function like mean(),min(),max(),count() etc in place of sum().

Lambda Functions

Say we want to create a new column indicating whether a given order was profitable or not (1/0).

We need to apply a function which returns 1 if Profit > 0, else 0. This can be easily done using the apply() method on a column of the dataframe.

So this brings us to a logical end to study of pandas basic and widely used features. In the next and final part of this series, I will discuss about some of the issues which I faced while using pandas in performing Machine Learning algorithms. Till then stay tuned.