7 Unique Pandas Functions to Elevate Your Analysis
And no, I don’t mean dropna() or fillna().
As a data analyst who has worked on multiple projects and studied even more, I’ve found these functions to be incredibly helpful in the data analytics process.
For those who’ve been here before and for the sake of those who have come across this article for the first time, know that what I generally aim to achieve with these write-ups is to explain a concept or topic as simply as possible to ensure a little knowledge is gained with each swipe down the post.
Now let’s begin.
What is Pandas?
Pandas is the top Python library used, especially in the data analytics process, for working with structured and semi-structured data to perform data manipulation and analysis tasks. From data cleaning to data visualization, it’s your go-to library for performing most of the tasks you need in your project.
It has two primary data structures: the one-dimensional (Series) data structure and the more widely used two-dimensional (DataFrame) data structure. To give you an idea of how essential it is, for a small project, it is entirely possible to execute all your tasks using the Pandas library.
For a more detailed information of functions and techniques used in python, visit the Pandas Documentation
Pandas provides an extensive number of functions, so I thought it essential to compile a short list of no more than seven functions associated with Pandas DataFrames or the Pandas library in general that I have found to be especially useful in my analytics projects.
The first thing to do to use the pandas library is to import it.
import pandas as pd
.update()
:
It can be used in multiple ways, but what I like to use it for describes a term called “Isolation and Fix.”
Let’s imagine I discovered some rows in the month column of my dataset that are inconsistent with the rest of the data. I also noticed that other columns in these rows differ from the general data. To address this, I extract these odd rows, correct all the issues, and then reintegrate the corrected rows back into the dataset.
This is an example of how that would go.
# Sample data
data1 = pd.DataFrame({
'month': ['Jan', 'Feb', 'Marh', 'Aprh'],
'value': [10, 20, 305, 405]
}, index = [1, 2, 3, 4])
# extracted data
data2 = pd.DataFrame({
'month': ['Aprh','Marh'],
'value': [305, 405]
}, index=[3, 4])
# Correcting the odd rows
data2 = pd.DataFrame({
'month': ['Apr', 'Mar'],
'value': [30, 40]
}, index=[3, 4])
# Update the original data with corrected values
data1.update(data2)
2. .apply()
: This function is a powerful method used to apply a function along an axis of the DataFrame or Series. It’s highly versatile and can be used for a variety of operations, including applying custom functions, transforming data, and categorizing data, just to name a few. I’ll go through three cases I normally make use of this function for:
- Applying a function to each value in a column
# Sample data
data = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
data['A'] = data['A'].apply(lambda x: str(x) )
print(data.info())
- Categorizing data for a new column
# Sample data
data = pd.DataFrame({
'Age': [11, 16, 30, 51]
})
data['Category'] = data['Age'].apply(lambda x: 'Child' if x <= 12 else
'Teenager' if x <= 18 else 'Adult' if x > 20)
print(data)
- Creating a new dataframe
# Sample data
data = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Apply a lambda function to each element
data_squared = data.apply(lambda x: x ** 2)
print(data_squared)
3. .pipe()
: This function allows you to apply a series of functions to a DataFrame or Series in a pipeline, quite similar to `%>%` in R. It helps to make the code more readable and maintainable by chaining operations together.
For example, let’s assume I have a dataset and its columns are not in the correct scale, and I have to perform a series of operations to clean the data. Here’s how it can be applied.
# Sample data
data = pd.DataFrame({
'A': [10, 20, 30],
'B': [1, 2, 3]
})
def multiply(df):
df['A'] *= 100
return df
def add(df):
df['B'] = df['B'] + 10
return df
# Adding 10 to values in B and multiplying values in A with 100
data.pipe(multiply) \
.pipe(add) \
.pipe(print)
4. pd.crosstab()
: This function is one I use quite frequently when working with categorical data. It generates a table that displays the frequency distribution of the intersection between two or more categorical variables.
The frequency it displays can either be the count of occurrences of combinations of values or proportions representing the data in terms of percentage or relative frequency.
Let’s assume I have a dataset containing road accident information with a gender column and an accident severity column, and I want to find information such as, “Are men involved in more serious accidents than women?” This is an approach that can be used to answer the question.
# gives frequency as count of occurence
crossdf = pd.crosstab(df['Severity'], df['Gender'])
crossdf = crossdf.reset_index()
print(crossdf)
# gives frequency as proportion
crossdf = pd.crosstab(df['Severity'], df['Gender'], normalize = 'index')
crossdf = crossdf.reset_index()
print(crossdf)
5. .pct_change()
: This function calculates the percentage change between the current and a prior element along a given axis. It is useful for analyzing the relative changes in data, such as financial or time series data.
A relatively simple but very useful function used in trend analysis to show changes over time. Let’s assume we have a dataset containing monthly sales figures. Here’s how we can calculate the percentage change.
data = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
'Sales': [2000, 2200, 2100, 2500, 2700]
})
# Set 'Month' as index
data.set_index('Month', inplace=True)
# Calculate percentage change
data['Sales Change (%)'] = round((data['Sales'].pct_change() * 100), 2)
print(data)
6. .get_dummies()
: This function is used to convert categorical data into a format that can be fed into a machine learning algorithm for data modelling, a technique popularly known as “one-hot encoding”. This is typically done by converting categorical variables into a series of binary (0 or 1) or boolean (True or False) columns.
If you’re tempted to ask of what use is this when scikit-learn has a label encoder function that performs similar tasks, these two functions are quite different..get_dummies()
is preferred for non-ordinal data since it avoids implying any order among categories. On the other hand, LabelEncoder
is more compact and suitable for ordinal data such as education level which has an ordered relationship (Middle School — High School — University).
Let’s assume I have a column containing answers to a question which is either “yes” / “no”, .get_dummies()
will create new columns, one for each category. Each row in these new columns will have a 1 in the column corresponding to its category and 0 in the others.
# Sample DataFrame with a categorical column
data = pd.DataFrame({
'Answer': ['yes', 'no', 'no', 'yes', 'no']
})
# Convert categorical column to dummy variables
dummies = pd.get_dummies(data['Answer'])
print(dummies)
7. .combine_first()
:It is used to combine two DataFrames by filling in missing values in one DataFrame with values from another. It’s particularly useful when you have two datasets that complement each other and want to create a complete dataset by merging them.
Quite similar to the .merge()
function, one of the differences is that it automatically includes all distinct columns from both DataFrames and aligns them based on their common columns, filling missing values in the first DataFrame with corresponding values from the second. It can be used interchangeably with .merge()
depending on your use case.
Now, let’s assume for the last time a company that collects employee performance data from two departments, HR and Sales. Both departments track employee data, but they focus on different aspects. If I want to analyze HR data but discover some values were missing and I also need additional information from Sales, here’s how I would do this:
# HR Data (data1)
data1 = pd.DataFrame({
'EmployeeID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'Sales', 'HR', None],
'Salary': [60000, 55000, 58000, 62000],
})
# Sales Data (data2)
data2 = pd.DataFrame({
'EmployeeID': [101, 102, 104, 105],
'SalesTarget': [50000, 70000, 65000, 80000],
'SalesAchieved': [48000, 75000, 66000, 81000],
'Department': ['HR', None, 'Sales', 'Sales']
})
# Combine using combine_first
combined_data = data1.set_index('EmployeeID').
combine_first(data2.set_index('EmployeeID')).reset_index()
print(combined_data)
That’s all, if you enjoyed this article as much as I enjoyed writing it. Please leave as many claps as possible (up to 50) and let me know in the comments what other Pandas functions you normally use.
The contents of external submissions are not necessarily reflective of the opinions or work of Maven Analytics or any of its team members.
We believe in fostering lifelong learning and our intent is to provide a platform for the data community to share their work and seek feedback from the Maven Analytics data fam.
Submit your own writing here if you’d like to become a contributor.
Happy learning!
-Team Maven