7 Unique Pandas Functions to Elevate Your Analysis

Obalabi adepoju
Learning Data
7 min readSep 2, 2024

--

And no, I don’t mean dropna() or fillna().

As a data analyst who has worked on multiple projects and studied even more, I’ve found these functions to be incredibly helpful in the data analytics process.

For those who’ve been here before and for the sake of those who have come across this article for the first time, know that what I generally aim to achieve with these write-ups is to explain a concept or topic as simply as possible to ensure a little knowledge is gained with each swipe down the post.

Now let’s begin.

What is Pandas?

Pandas is the top Python library used, especially in the data analytics process, for working with structured and semi-structured data to perform data manipulation and analysis tasks. From data cleaning to data visualization, it’s your go-to library for performing most of the tasks you need in your project.

It has two primary data structures: the one-dimensional (Series) data structure and the more widely used two-dimensional (DataFrame) data structure. To give you an idea of how essential it is, for a small project, it is entirely possible to execute all your tasks using the Pandas library.

For a more detailed information of functions and techniques used in python, visit the Pandas Documentation

Pandas provides an extensive number of functions, so I thought it essential to compile a short list of no more than seven functions associated with Pandas DataFrames or the Pandas library in general that I have found to be especially useful in my analytics projects.

The first thing to do to use the pandas library is to import it.

import pandas as pd
  1. .update() :

It can be used in multiple ways, but what I like to use it for describes a term called “Isolation and Fix.”

Let’s imagine I discovered some rows in the month column of my dataset that are inconsistent with the rest of the data. I also noticed that other columns in these rows differ from the general data. To address this, I extract these odd rows, correct all the issues, and then reintegrate the corrected rows back into the dataset.

This is an example of how that would go.

# Sample data
data1 = pd.DataFrame({
'month': ['Jan', 'Feb', 'Marh', 'Aprh'],
'value': [10, 20, 305, 405]
}, index = [1, 2, 3, 4])

# extracted data
data2 = pd.DataFrame({
'month': ['Aprh','Marh'],
'value': [305, 405]
}, index=[3, 4])

# Correcting the odd rows
data2 = pd.DataFrame({
'month': ['Apr', 'Mar'],
'value': [30, 40]
}, index=[3, 4])

# Update the original data with corrected values
data1.update(data2)
An example of our corrected data after updating

2. .apply(): This function is a powerful method used to apply a function along an axis of the DataFrame or Series. It’s highly versatile and can be used for a variety of operations, including applying custom functions, transforming data, and categorizing data, just to name a few. I’ll go through three cases I normally make use of this function for:

  • Applying a function to each value in a column
# Sample data
data = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

data['A'] = data['A'].apply(lambda x: str(x) )
print(data.info())
The “A” column is now an object after using the apply method.
  • Categorizing data for a new column
# Sample data
data = pd.DataFrame({
'Age': [11, 16, 30, 51]
})

data['Category'] = data['Age'].apply(lambda x: 'Child' if x <= 12 else
'Teenager' if x <= 18 else 'Adult' if x > 20)

print(data)
  • Creating a new dataframe
# Sample data
data = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})

# Apply a lambda function to each element
data_squared = data.apply(lambda x: x ** 2)

print(data_squared)

3. .pipe() : This function allows you to apply a series of functions to a DataFrame or Series in a pipeline, quite similar to `%>%` in R. It helps to make the code more readable and maintainable by chaining operations together.

For example, let’s assume I have a dataset and its columns are not in the correct scale, and I have to perform a series of operations to clean the data. Here’s how it can be applied.

# Sample data
data = pd.DataFrame({
'A': [10, 20, 30],
'B': [1, 2, 3]
})

def multiply(df):
df['A'] *= 100
return df

def add(df):
df['B'] = df['B'] + 10
return df

# Adding 10 to values in B and multiplying values in A with 100
data.pipe(multiply) \
.pipe(add) \
.pipe(print)
Both operations have been applied on the data in a single chain operation

4. pd.crosstab() : This function is one I use quite frequently when working with categorical data. It generates a table that displays the frequency distribution of the intersection between two or more categorical variables.

The frequency it displays can either be the count of occurrences of combinations of values or proportions representing the data in terms of percentage or relative frequency.

Let’s assume I have a dataset containing road accident information with a gender column and an accident severity column, and I want to find information such as, “Are men involved in more serious accidents than women?” This is an approach that can be used to answer the question.

# gives frequency as count of occurence
crossdf = pd.crosstab(df['Severity'], df['Gender'])

crossdf = crossdf.reset_index()
print(crossdf)
A cross table showing the number of both genders in each accident category
# gives frequency as proportion
crossdf = pd.crosstab(df['Severity'], df['Gender'], normalize = 'index')
crossdf = crossdf.reset_index()
print(crossdf)
A cross table showing the proportion of both genders in each severity of an accident

5. .pct_change() : This function calculates the percentage change between the current and a prior element along a given axis. It is useful for analyzing the relative changes in data, such as financial or time series data.

A relatively simple but very useful function used in trend analysis to show changes over time. Let’s assume we have a dataset containing monthly sales figures. Here’s how we can calculate the percentage change.

data = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
'Sales': [2000, 2200, 2100, 2500, 2700]
})

# Set 'Month' as index
data.set_index('Month', inplace=True)

# Calculate percentage change
data['Sales Change (%)'] = round((data['Sales'].pct_change() * 100), 2)

print(data)
Each row in the Sales Change column represent by how much sales figures increased since last month

6. .get_dummies() : This function is used to convert categorical data into a format that can be fed into a machine learning algorithm for data modelling, a technique popularly known as “one-hot encoding”. This is typically done by converting categorical variables into a series of binary (0 or 1) or boolean (True or False) columns.

If you’re tempted to ask of what use is this when scikit-learn has a label encoder function that performs similar tasks, these two functions are quite different..get_dummies() is preferred for non-ordinal data since it avoids implying any order among categories. On the other hand, LabelEncoder is more compact and suitable for ordinal data such as education level which has an ordered relationship (Middle School — High School — University).

Let’s assume I have a column containing answers to a question which is either “yes” / “no”, .get_dummies() will create new columns, one for each category. Each row in these new columns will have a 1 in the column corresponding to its category and 0 in the others.

# Sample DataFrame with a categorical column
data = pd.DataFrame({
'Answer': ['yes', 'no', 'no', 'yes', 'no']
})

# Convert categorical column to dummy variables
dummies = pd.get_dummies(data['Answer'])

print(dummies)
All yes’s have been converted to 1’s and vice versa

7. .combine_first() :It is used to combine two DataFrames by filling in missing values in one DataFrame with values from another. It’s particularly useful when you have two datasets that complement each other and want to create a complete dataset by merging them.

Quite similar to the .merge() function, one of the differences is that it automatically includes all distinct columns from both DataFrames and aligns them based on their common columns, filling missing values in the first DataFrame with corresponding values from the second. It can be used interchangeably with .merge() depending on your use case.

Now, let’s assume for the last time a company that collects employee performance data from two departments, HR and Sales. Both departments track employee data, but they focus on different aspects. If I want to analyze HR data but discover some values were missing and I also need additional information from Sales, here’s how I would do this:

# HR Data (data1)
data1 = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'Sales', 'HR', None],
'Salary': [60000, 55000, 58000, 62000],
})

# Sales Data (data2)
data2 = pd.DataFrame({
'EmployeeID': [101, 102, 104, 105],
'SalesTarget': [50000, 70000, 65000, 80000],
'SalesAchieved': [48000, 75000, 66000, 81000],
'Department': ['HR', None, 'Sales', 'Sales']
})

# Combine using combine_first
combined_data = data1.set_index('EmployeeID').
combine_first(data2.set_index('EmployeeID')).reset_index()

print(combined_data)
Missing values for department in both data has been filled up and other columns were combined as well

That’s all, if you enjoyed this article as much as I enjoyed writing it. Please leave as many claps as possible (up to 50) and let me know in the comments what other Pandas functions you normally use.

The contents of external submissions are not necessarily reflective of the opinions or work of Maven Analytics or any of its team members.

We believe in fostering lifelong learning and our intent is to provide a platform for the data community to share their work and seek feedback from the Maven Analytics data fam.

Submit your own writing here if you’d like to become a contributor.

Happy learning!

-Team Maven

--

--

Learning Data
Learning Data

Published in Learning Data

Learn data analysis and business intelligence, share projects, build the data community, and develop our skills together.

Obalabi adepoju
Obalabi adepoju

Written by Obalabi adepoju

Data Analyst. I'm a data driven developer. I mostly write about Data Science, statistics and ML. Writing about whatever I find useful 😄