Deep Dive in Machine Learning with Python

Part — IX: Advanced PANDAS operations

Rajesh Sharma
Analytics Vidhya
5 min readJan 2, 2020

--

Thanks to Kung Fu Panda:)

Welcome to the ninth blog of Deep Dive in Machine Learning with Python, in the last blog we covered how to perform several Pandas operations using Autism Spectrum Disorder dataset. In today’s blog, we will understand how to work with various advanced Pandas functions.

In today’s blog as well, I’ll continue with my interest in medical data and use the popular Heart Disease Dataset that is available on the UCI Machine Learning Repository and donated by:

1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Similar to my previous blogs, I’ll also share a bonus tip at the end of this blog.

So, let’s get started.

Download the dataset

We will be using the Cleveland Dataset so download the file processed.cleveland.data from the UCI ML repository link. And, brief information on dataset features also provided in the file heart-disease.names.

Import the dataset

Import the data file in Pandas DataFrame

Here, we got the dataset imported as Pandas DataFrame. Next, add the column names to the DataFrame.

Adding column names to the DataFrame

Problem-1: How to print the dataframe rows by using ‘for’ loop and ‘iterrows()’ method?

Solution-1

In the above example, we got the first two rows of the dataframe in the form of ‘tuple’.

Problem-2: How to view the rows having missing values as ‘?’ in any of the columns?

CASE-I

Solution-2.1

In the above example, we were able to find that ‘CA’ and ‘THAL’ as the two columns having missing values.

CASE-II

Solution-2.2

That’s the much simpler way of finding the columns with value ‘?’.

CASE-III

Solution-2.3

In the above example, we got the records with missing values in the ‘CA’ feature.

CASE-IV

Solution-2.4

In the above example, we got the records with missing values in the ‘THAL’ feature.

Problem-3: How to replace the missing values ‘?’ with None in the DataFrame?

Solution-3

So, in the above example, we replaced the ‘?’ with None in ‘CA’ and ‘THAL’ columns.

DataFrame.applymap() method applies a function on every element of the DataFrame and returns a single value against every element.

Problem-4: How to perform the SQL UNION operation in the Pandas DataFrame?

CASE-I: Performing UNION at the row-level

Solution-4.1

In the above example, we merged the top-5 records of cleveland_df as two separate DataFrames into a single output DataFrame.

CASE-II: Performing the UNION at the column level

Solution-4.2

Here, in the above example, we got the two DataFrames of different dimensionality(i.e. rows and columns) merged into a single DataFrame at the column level.

Problem-5: How to build the frequency table for certain categories in the records?

Solution-5

Here, in the above example, the crosstab function built the cross-tabulation table having the frequency of certain Chest Pain types across the Blockage Category.

Problem-6: How to join the two DataFrames?

Dataset-1 and Dataset-2
INNER and LEFT JOIN
RIGHT and OUTER JOIN

Problem-7: How to add a new column ‘FIX_CHOL’ in the dataframe by using column ‘CHOL’?

Solution-7

Here, in this example, we added a new column fix_chol which contains the values from the chol column based on the condition.

Pandas.apply enable us to pass a function and apply it to every element of the Pandas Series(means column).

Problem-8: How to convert the innermost columns of the DataFrame to the row index?

CASE-I

Solution-8.1

Stacking a DataFrame means moving the innermost column index to become the innermost row index.

CASE-II: Unstack

Unstacking a DataFrame means moving the innermost row index to become the innermost column index that means the inverse of stacking.

Solution-8.2

Problem-9: How to convert the categorical variables into dummy/indicator variables?

Solution-9

Here, in this example, we converted the CA feature to the different indicator variables of the data categories by using get_dummies function.

Problem-10: How to add new columns in the DataFrame having values of columns ‘chest_pain’ and ‘exc_angina’?

Solution-10

Here, in the above example, we added two new columns Angina which contains the name of the angina form(that means either chest pain or exercise-induced angina) and AnginaType which stores the angina type by using MELT function.

Melt in pandas Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.

Bonus Tip

1. Loops v/s Apply method

Using Loops

Using the Apply method

This shows us that loops are inefficient and takes more time as compared to apply methods. So, always use apply methods to perform any manipulation in the DataFrame.

Congratulations, we come to the end of this blog. To summarize, we covered some advanced Pandas operations. From the next blog, we will start visualizing the data by using Pandas, MatplotLib and Seaborn.

If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:

https://github.com/Rajesh-ML-Engg/Deep_Dive_in_ML_Python

Thank you and happy learning!!!!

Blog-10: Data Visualization using Pandas & Matplotlib

--

--

Rajesh Sharma
Analytics Vidhya

It can be messy, it can be unstructured but it always speaks, we only need to understand its language!!