Mastering Data Analysis with Pandas: An Introduction to the Powerful Python Library

Unlock the Full Potential of Pandas for Efficient Data Manipulation and Analysis

10 min readMar 23, 2023

The Pandas is an essential Python library for anyone looking to work with data. Whether you’re analyzing, manipulating, or visualizing data, Pandas makes the process easy and efficient. Let’s dive in and start using this powerful tool to unlock the full potential of your data.

In case you missed my previous article where I shared my experience of learning Data Science, you can find it below. This article is a follow-up to that one and will build on the concepts and ideas I introduced earlier.

How I switched from Software Engineering to Data Science …

Few useful practical tips

medium.com

A. Prerequisite

Before working with Pandas, completing a few essential prerequisites is important to set yourself up for success.

Jupyter Notebook Installation

Running the Jupyter Notebook - Jupyter/IPython Notebook Quick Start Guide 0.1 documentation

When started, the Jupyter Notebook App can access only files within its start-up folder (including any sub-folder). No…

jupyter-notebook-beginner-guide.readthedocs.io

Project Jupyter

Project Jupyter's tools are available for installation via the Python Package Index, the leading repository of software…

jupyter.org

2. Python Basics

Python for beginners — 5 Hrs video (Short, relevant, and sufficient)

B. Pandas Introduction

You can install Pandas by running the following command in a Jupyter Notebook cell:

!pip install pandas

When you put an exclamation mark (!) at the beginning of a cell in Jupyter notebook, it allows you to run commands as if you were typing them directly into a terminal.

The primary building block of Pandas

1.1 Pandas Series

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floating-point numbers, Python objects, etc.). It is similar to a column in a spreadsheet or a SQL table.

Code snippet to create a Pandas Series object:

import pandas as pd

# create a Pandas Series object
s = pd.Series([10, 20, 30, 40, 50])

# print the Series object
print(s)

0    10
1    20
2    30
3    40
4    50
dtype: int64

1.2 Pandas DataFrame

A DataFrame is a multi-dimensional labeled data structure with columns of potentially different types. It is like a spreadsheet or a SQL table. You can think of a DataFrame as a bunch of Pandas Series objects that share the same index.

Code snippet to create a Pandas DataFrame object:

import pandas as pd

# create a dictionary with some data
data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'age': [25, 32, 18, 47, 22],
        'country': ['USA', 'Canada', 'USA', 'Canada', 'USA']}

# create a Pandas DataFrame object from the dictionary
df = pd.DataFrame(data)

# print the DataFrame object
print(df)

       name  age country
0     Alice   25     USA
1       Bob   32  Canada
2   Charlie   18     USA
3     David   47  Canada
4     Emily   22     USA

1.3 Different ways to read data in Pandas

1.3.1 Read data from a CSV file:

import pandas as pd

# read CSV file into a Pandas DataFrame
df = pd.read_csv('data.csv')

# print the DataFrame
print(df)

1.3.2 Read data from an Excel file:

import pandas as pd

# read Excel file into a Pandas DataFrame
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# print the DataFrame
print(df)

1.3.3 Read data from a SQL database:

When working with data from a SQL database, you need to establish a connection between Python and the database using an appropriate Python library. This connection allows you to pass a query to the database and retrieve data in a format that can be processed using Pandas. One such library is pysqlite3, which provides a simple interface for working with SQLite databases.

To use pysqlite3, you need to first install it by running the command !pip install pysqlite3notebook cell.

If you have data stored in a PostgreSQL, MySQL, or other SQL server, you’ll need to establish a connection between Python and the database to access and manipulate the data. To do this, you’ll need to use a Python library that is compatible with the specific database management system (DBMS) you’re working with.

import pandas as pd
import sqlite3

# create a database connection
conn = sqlite3.connect('employee.db')

# read SQL query into a Pandas DataFrame
df = pd.read_sql_query('SELECT * FROM employee', conn)

# print the DataFrame
print(df)

1.3.4 Read data from a JSON file:

import pandas as pd

# read JSON file into a Pandas DataFrame
df = pd.read_json('data.json')

# print the DataFrame
print(df)

1.3.5 Read data from a URL:

import pandas as pd

# read URL into a Pandas DataFrame
url = 'https://example.com/data.csv'
df = pd.read_csv(url)

# print the DataFrame
print(df)

4. Important DataFrame operations

Pandas is a powerful library with many operations but not all of these operations are equally useful or necessary for every analysis task. In fact, many data analysis tasks can be accomplished using a subset of the most important and useful DataFrame operations.

Therefore, when working with Pandas, it’s important to focus on learning the most commonly used and useful DataFrame operations that cover 70–80% of data analysis tasks. By mastering these core operations, you can quickly and efficiently perform data cleaning, manipulation, and analysis tasks on your datasets without getting overwhelmed by the full range of available options.

It’s time to give your Jupyter notebook a new friend — the dataset! We will start with the Telecom Customer churn Data set which can be downloaded from here. Go ahead, upload it, and let the magic of data analysis unfold.

4.1 Reading the dataset from a CSV file:

import pandas as pd

df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

4.2 Subset operations

4.2.1 Viewing the first few rows of the dataset:

df.head(n) #The n can be replaced with any integer variable to get the number of top n rows.

4.2.2 Viewing the last few rows of the dataset:

df.tail(n) #The n can be replaced with any integer variable to get the number of bottom n rows.

4.2.3 Selecting rows and columns

loc is used to select rows and columns by label:

# select rows and columns using loc
df.loc[1:3, ['customerID', 'Dependents']]

Here, we are selecting rows 1 to 3 and columns “customerID” and “Dependents” using loc.

iloc is used to select rows and columns by integer position:

# select rows and columns using iloc
df.iloc[1:3, [0, 1]]

Here, we are selecting rows 1 to 3 and columns 0 and 1 (which are the integer positions of “customerID” and “gender”) using iloc.

4.3 Checking the data types of each column:

df.dtypes

4.4 Checking the overview of your dataset:

The .info() method in pandas is a quick and easy way to get an overview of your dataset. It provides essential details such as the number of rows and columns, which gives you a rough idea of how much data you're dealing with. Additionally, it tells you the number of non-null values, which gives you an idea of how complete your dataset is.

df.info()

4.5 Summarization operations

# Check the summary statistics of the numerical columns
df.describe()

# Check the shape of the dataset (number of rows and columns)
df.shape

# Count the unique values in a column
df['TotalCharges'].value_counts() 

# Checking the unique values in a column
df['Contract'].unique()

# Get Sum of numerical field
df['MonthlyCharges'].sum()

# Get Minimum of numerical field
df['MonthlyCharges'].min()

# Get Maximum value of numerical field
df['MonthlyCharges'].max()

# Get Mean of numerical field
df['MonthlyCharges'].mean()

# Get Median of numerical field
df['MonthlyCharges'].median()

# Get Quantile of numerical field
df['MonthlyCharges'].quantile([0.25, 0.75])

# Get count of each object
df.count()

# Get Standard Deviation of each numerical field
df.std()

# Get Variance of each numerical field
df.var()

4.6 Handling missing data

When dealing with nulls in data, you generally have two options.

The first option is to remove any rows or columns that contain null values. This can be a good choice if you have a lot of nulls and the rows or columns they are in aren’t critical to your analysis.
The second option is to replace the null values with non-null values. This technique is called imputation and it can help you retain more data for your analysis. Imputation can be especially useful if you have a lot of missing values and removing them would leave you with too little data to work with.

4.6.1 Checking each column in DataFrame for missing values:

df.isnull()

4.6.2 Checking the number of missing values in each column:

df.isnull().sum()

4.6.3 Dropping columns with missing values (Python’s None and NumPy’s np.na):

df.dropna(axis=1, inplace=True)

The axis=1 is used for the column. axis=0 is used for the row (default).

Note: When you set the parameter inplace to True, the changes you make to the DataFrame will happen directly to the original DataFrame. In other words, you won't get a new DataFrame returned to you after the operation. Instead, the original DataFrame will be updated with the changes that you made.

On the other hand, if you leave the inplace parameter as False, which is the default, the operation will return a new copy of the DataFrame that includes the changes you made. However, this new DataFrame does not replace the original DataFrame, so you will need to assign it to a new variable if you want to keep the changes.

4.6.4 Replacing missing values with value (mean, median, mode, etc):

We’ll use example DataFrames since there are no missing values in our dataset.

import pandas as pd
import numpy as np

# create a sample dataframe with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4],
                   'B': [5, np.nan, 7, 8],
                   'C': [np.nan, 10, 11, 12]})

# replace missing values with column mean
df.fillna(df.mean(), inplace=True)

# display the dataframe with missing values replaced
print(df)

The value could be mean, median, or mode of column based on the type of data and their distribution (More on this in later articles)

4.7 Reshaping Data (Renaming, sorting, re-indexing)

4.7.1 Renaming columns:

df.rename(columns={'PaymentMethod': 'NewPaymentMethod'}, inplace=True)

4.7.2 Sorting the DataFrame by a column::

df.sort_values('MonthlyCharges', ascending=False) #ascending=True is default

4.7.3 Setting and Resetting index of DataFrame:

Set a new index at the DataFrame

df.index = df['customerID']

Reset the index of DataFrame

df.reset_index(drop=True, inplace=True)

4.7.4 Drop column of DataFrame:

df.drop('customerID', axis=1, inplace=True)

4.8 Grouping the DataFrame by a column and aggregating values:

df.groupby('gender').agg({'tenure': 'mean'})

4.9 Adding a new column to the DataFrame:

df['NewCustomerID'] = df['customerID'] +'-'+ df['gender']

4.10 Combining DataSets

Let’s say we have two examples DataFrames, df1 and df2, with a common column key that we want to merge on:

import pandas as pd

# Create first dataframe
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
                    'value': [1, 2, 3, 4]})

# Create second dataframe
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'],
                    'value': [5, 6, 7, 8]})

# Merge df1 and df2 on the 'key' column
merged_df = pd.merge(df1, df2, on='key')

This will merge df1 and df2 on the 'key' column, resulting in a new DataFrame merged_df that contains all the rows from both DataFrames where the 'key' value matches. The resulting DataFrame will have two 'value' columns (one from each original DataFrame), which can be disambiguated using the suffixes parameter of the merge function.

C. Alternative of Pandas

NumPy: NumPy is a fundamental library for scientific computing in Python. It provides a powerful N-dimensional array object, along with functions for working with these arrays.
Dask: Dask is a parallel computing library that provides advanced parallelism for analytics in Python. It is designed to scale from single machines to clusters of thousands of machines.
Apache Arrow: Apache Arrow is an in-memory data structure that can be used for efficient data interchange between different programming languages. It provides a standard for representing data in memory, making it easier to move data between different systems.
Modin: Modin is a distributed DataFrame library that provides a Pandas-like API for working with large datasets. It can automatically scale Pandas operations to use multiple cores or nodes.
Vaex: Vaex is a Python library for lazy Out-of-Core DataFrames that works well with huge datasets. It can handle datasets that are too large to fit into memory by using memory mapping and other techniques.

D. Valuable resources

1. Official documentation

2. Python Data Science Handbook

3. Pandas cheatsheet

Conclusion

When you’re learning a programming library like Pandas, it can be challenging to remember all of the different operations and syntax rules. However, working with real-world datasets and projects can make the learning process much more efficient and effective. By repeatedly practicing with different datasets and use cases, you’ll start to develop muscle memory and gain more confidence in using these tools effectively.

Happy Learning !!

If you found this article helpful, please consider clapping and sharing it with others who might benefit. Your support can help this content reach the relevant audience and provide them with valuable insights and knowledge🙂