Pandas Performance: Optimizing Your DataAnalysis Workflow

Published in

featurepreneur

4 min readMar 5, 2023

What are Pandas?

Pandas is an open-source library for data manipulation and analysis. Its DataFrame and Series provide an easy and simple way of handling numerical tables and time series data. It has methods for filtering, grouping, and reshaping data, as well as handling missing data, and merging and visualizing the data.

Why your Pandas code may need to be faster?

Although Pandas is a powerful library for handling data, it can perform slower than expected in various scenarios. These include working with large datasets, complex data transformation, iterative operations, and non-vectorized operations.

But, most of the time, the slowness of Pandas arises from using unoptimized code and techniques. Here, we will look at such examples and provide the necessary correction that needs to be made to reduce the run-time of your data analysis.

Dataset used

The CSV files of the datasets used in the examples can be generated by running this python code.

import pandas as pd
import numpy as np
import random

companies_list = ["Amazon", "Google", "Infosys", "Mastercard", "Microsoft", 
            "Uber", "IBM", "Apple", "Wipro", "Cognizant"]

jobs_list = ["Software Development Engineer", "Machine Learning Engineer",
            "Data Scientist", "Data Analyst", 
            "Artificial Intelligence Engineer", "Back-end Developer", 
            "Front-end Developer", "Research Scientist", 
            "IOS Developer", "Android Developer"]

cities_list = ["Alberta", "British Columbia", "Manitoba", "New Brunswick", "Newfoundland and Labrador",
             "Ontario", "Quebec", "Nunavut", "Prince Edward Island", "Northwest Territories"]

data1 = []
data2 = []
for i in range(4_096_000):

    emp_id = i+1
    company = random.choice(companies_list)
    job = random.choice(jobs_list)
    city = random.choice(cities_list)
    salary = int(round(np.random.rand(), 3)*10**6)
    employment = random.choices(["Full Time", "Intern"])[0]
    rating = round((np.random.rand()*5), 1)
    
    data1.append([emp_id,company, job, city])
    data2.append([emp_id, salary, employment, rating])
    
data = pd.DataFrame(data1, columns=["Employee Id", "Company Name", "Title",
                                   "Location"])
data.to_csv("dataset1.csv",index=False)
data = pd.DataFrame(data2, columns=["Employee Id","Employee Salary", "Employment Status", "Employee Rating"])
data.to_csv("dataset2.csv",index=False)

1. Filtering categorical data

We frequently have to filter out data frames and select only a part of it that satisfies a particular condition. Normally filtering is done by indexing the data frame with a boolean mask as follows

%%timeit
dataset1[dataset1["Location"] == "Ontario"]

############################################################
# 175 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

An optimized way of doing the same is to use the group by() method to group the dataset based on a column and obtain the individual group using the get_group() method

data_grp = dataset1.groupby("Location")

%%timeit
data_grp.get_group("Ontario")

############################################################
# 18.5 ms ± 287 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
############################################################

2. Combining DataFrames

Two data frames can be combined based on a common column to get a single data frame. This is similar to the join statement in SQL. Two data frames can be merged as follows:

%%timeit
pd.merge(dataset1,dataset2,on="Employee Id", how="inner")

############################################################
# 970 ms ± 32.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

A faster way to do the same is to use the join() method by setting the index of both the DataFrames as the common column to join on.

%%timeit
dataset1.set_index("Employee Id").join(dataset2.set_index("Employee Id"))

############################################################
# 386 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

3. Iterative over a DataFrame

The process of visiting every row in a data frame and performing an operation on it is called iterating or looping over a data frame. Iterating over a data frame is a costly operation and is mainly avoided altogether and vectorized operations are instead used.

%%timeit
lst = []
for row in dataset2["Employee Salary"]:
    lst.append(row/12)

############################################################
# 412 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

%%timeit
lst = dataset2["Employee Salary"]/12

############################################################
# 6.08 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
############################################################

However, there are situations where looping may be necessary for the execution of the operation. Pandas provide two methods to iterate through the DataFrame — arrows () and itertuples(). Although both methods can be used to do the same operation, iter tuples() are much faster than iterrows()

%%timeit
lst = []
for row in dataset2.itertuples():
    lst.append(row._2/12)

############################################################
# 2.48 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

4. Mentioning the datatypes

When Pandas reads a CSV file and converts it into a DataFrame, it infers the data type of the column and assigns the largest suitable datatype(int64,float64, etc.). In many cases, this large of a datatype is unnecessary. Hence, the datatypes for the rows can be specified when reading the csv itself by passing a datatype dict.

dataset2 = pd.read_csv("dataset2.csv")
dataset2.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4096000 entries, 0 to 4095999
Data columns (total 4 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Employee Id        int64  
 1   Employee Salary    int64  
 2   Employment Status  object 
 3   Employee Rating    float64
dtypes: float64(1), int64(2), object(1)
memory usage: 125.0+ MB
"""

dtypes = {
    "Employee Id":"uint8",
    "Employee Salary":"uint8",
    "Employment Status":"object",
    "Employee Rating":np.float16
}
dataset2 = pd.read_csv("dataset2.csv",dtype=dtypes)
dataset2.info()

"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4096000 entries, 0 to 4095999
Data columns (total 4 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Employee Id        uint8  
 1   Employee Salary    uint8  
 2   Employment Status  object 
 3   Employee Rating    float16
dtypes: float16(1), object(1), uint8(2)
memory usage: 46.9+ MB
"""

By mentioning smaller datatypes, we were able to reduce the size of the DataFrame by nearly 3 times.