Complete guide to Pandas library (Python Code) — Part 2/ 4

Vijay yadav
Analytics Vidhya
Published in
9 min readJul 18, 2021

We learnt a lot of fundamentals in part one. Now you know how to make subsets of data-frame using multiple conditions, filters and selections.
We also covered how to handle missing values, identifying reasons behind the missing values and summarized datasets using multiple Pandas function.

This part will focus on Data Wrangling capabilities of Pandas library.

Photo by Firmbee.com on Unsplash

Index

  • Work with different data types.
  • Reshaping DataFrame objects
  • Pivoting
  • Data aggregation and GroupBy operations

For the next two articles, we will be using Stack overflow survey for 2020 dataset. This dataset has nearly 65,000 responses from over 180 countries. The foal of survey is to cover all aspects of the developer experience from career satisfaction and job search to education and opinions on open source software.

Dataset : https://insights.stackoverflow.com/survey

Load Dataset, and basic data understanding.

data = pd.read_csv("survey_results_public.csv")
data.head()
First Glance of the Dataset

Always keep the first goal of your EDA process to understand and verify authenticity of all the attributes in your dataset.
Before starting anything check for few basic things mentioned below,

Considering Dataset = df

df.shape  #Shape of the whole datasetdf.dtypes  #Data type of every attributedf.describe  #Overview of all statistical detail of Numerical col.df.describe(include = 'object')  #For Object Columns.df.isnull().sum()  #Total number of null values in Dataset.df['column name'].unique()  #Total number of unique cat. values

Work with different data types

Photo by Deon Black on Unsplash

Performing above steps you will get a good overview of your dataset, and let’s say you find various attributes that are suppose to be object or categorical format are in integer format or vice versa, also what if a column that is suppose to be in a date format is in object format. We need to fix this before going ahead with our analysis as this is a very important step to make sure our dataset is telling the correct story.

# Converting numeric datatype to Objectdata['Respondent'] = data['Respondent'].astype('object')# Converting object datatype to Numericdata['Age'] = pd.to_numeric(data.Age)# Converting object datatype to Date formatnew_data['date']  = pd.to_datetime(new_data['date'])

Reshaping and Pivot tables

Now, reshaping dataset simply means making and breaking the dataset in a format that suits most to your requirement. For example, if you want to create time series plot, you need a date column as your index and the value that needs to be plotted has to be in columns against each date.

For such a requirement you don’t even need rest of the columns in dataset to work with, so you can either drop them or simply create a separate dataset with specific columns required.

Now, to reshape the dataset you can use multiple functions in pandas namely pivot() , pivot_table () , melt() , stack() , unstack() based on the manipulations you want to perform.

Let’s see this in action, and for this we will use a separate dataset. just for more convenience.

Now, as you can see this dataset has more numerical columns but no date attribute. So, we can create one just for demonstration purposes

new_data['date'] = pd.date_range(start='2020/01/01'  , end='2020/10/29', freq='D')
# pd.daterange() -- This function creates a date range.
# start -- This specifies the start of the date range# end -- This specifies the end of the date range.# freq -- Creates dates based on days. Other options are W, M, Y.# period -- Specifies how many interval between dates(12 for a Year)

Creating a Pivot

new_data.pivot(index='date' , columns='cp' ,values = 'age' ).head()

While you create a pivot index , column and values are three important parameters which need to be specified and the data will be re-shaped accordingly. Note that the values parameter is the one which will be plotted against these row indexes and column indexes.

Important : Remember if the values parameter is omitted and there are multiple columns in the dataset , resulting pivoted dataframe will take all those columns and create an hierarchical index.

If you see, in this example I did not provide a values parameter, hence the pivot created a multi-index dataframe with all the other columns in dataset.

There are so many NaN values in above dataset, and we can fill them using fillna() argument with any values.

new_data.pivot(index='date' , columns='cp' ,values = ['age','sex' ,'chol']).fillna(0).head()

pivot() will throw an error of ValueError: Index contains duplicate entries, cannot reshape if the index/column pair is not unique. In this case, consider using pivot_table() which is a generalization of pivot that can handle duplicate values for one index/column pair.

new_data.pivot_table(index = 'age' , columns= 'sex' , values = ['chol' , 'thall','oldpeak'] ).head()

Melt( )

melt() is another function that can be used to reshape the dataset in similar way, and it takes following parameters.

new_data[['age' , 'sex' , 'cp']].melt(id_vars= ['age' , 'sex'] , var_name='Rest of the column').head()

Important: In melt() , id_vars is the parameter which will be treated as index values for reshaping , same as index and columns in pivot() and value_vars is same as the value parameter.

However, in melt if you do not provide the value_vars parameter it will create the stack vertically with every column in the dataset. (try to plot it once)

Data aggregation and GroupBy operations

Photo by Chang Duong on Unsplash

This section of the article is the most interesting part, and there is absolutely no limit of how many things that can be done using groupby() and agg() functions, specially by combining them together.

new_data.groupby(["sex" , "cp"]).count()

Here, we used groupby() function to group the large amount of data and now we can apply various functions on the whole group of data.
groupby() operation involves some combination of splitting the object, applying a function, and combining the results. Output is a groupby object that contains information about the groups.

Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object

new_data.groupby(["sex" , "cp"]).max()

What you see here as an output is basically the maximum value of each attribute when grouped by sex and age column, this can be interpreted as

  • the maximum age of a female (sex = 0) having low chest pain (cp = 0) is 71
  • the maximum age of a female (sex = 0) having high chest pain (cp = 3) is 69

Similarly you can use rest of the functions and get these meaningful insights so quickly.

You can choose how many columns you want to show in the output by using filters after the groupby clause, let’s only select age and chol columns to display their maximum values.
This is useful, incase you want to see the maximum values but only for specific columns and not all of them.

 new_data.groupby(["sex" , "cp"])[['age' , 'chol']].mean()

Suppose you want to find all these values for only female patients, or only people with high chest pain value of 3 to maybe prioritize their medication , you can do so by filtering a column to apply groupby functions on special group only and not the whole dataset

new_data[new_data['cp'] == 3].groupby(['sex' , 'thall']).max()
new_data[new_data['sex'] == 0].groupby(['sex' , 'thall']).max()

Also, incase you don’t want to keep these columns used for groupby as index, and want to treat them like a column in dataframe, or lets say you are done with manipulating the datasets using groupby functions and now you want to export them as a final dataset, you can do so by passing an additional parameter as_index = False and they will no longer be index values but a column in the dataframe. Very cool…isn’t it?

new_data[new_data['cp'] == 3].groupby(['sex' , 'thall'] , as_index = False).max()

These are only few ways in which you can use groupby, and I cant emphasize enough that how powerful this function is for all purpose analysis.

Now , let’s talk about aggregation function which is just as amazing as groupby, and if we combine them its more powerful.
Aggregation function , agg() is used to aggregate functions for multiple computations tasks at once. You can also pass a list of functions to be applied on a particular column and it can calculate all the respective values returning a multi-level index dataframe.

new_data.groupby(['sex' , 'cp']).agg(max)

This is the simplest way to use agg() and pass a function to carry out some operation , it will pass that function through every column and output the computed values in dataframe. But wait, isn’t the same thing done by just specifying max() function directly like in the last example, then why do we need agg() function.

This is where the agg() function really shines and shows us its true power, where you can use (key : value) pair to carry out different operations throughout the dataset at once. In this function key : value actually refers to (column : Function ) accordingly,

new_data.groupby(['sex', 'cp']).agg({'age': max , 'trtbps' : np.mean})

In above example, as you can see we were able to compute two different values for two different columns at the same time and you can imagine how powerful this is in terms of generating insights. agg() function takes a dict or list of key value pair to operate on grouped values.

You can literally create your own dict or list with mappings of different functions to different columns and simply pass them to the agg() functions and it will return you the dataframe with all the computed values.

agg_fucntion = {'age': [max , min , sum] , 'trtbps' : min}new_data.groupby(['sex', 'cp']).agg(agg_fucntion)

As you can see, I passed a list of functions to be applied on the column age, and one function to be applied on trtbps . Resulting dataframe outputs the values as expected and the fact that this can be used in so many analysis situations and carry out complicated logics with so much ease, is exciting.

Conclusion.

Photo by Alexander Schimmeck on Unsplash

So yea , there you go with all about the most used feature and functions for data wrangling using Pandas in Python which you will definitely have to use in your EDA process.

  • We covered about working with different datatypes , and how to handle them or change their datatype as required.
  • Reshaping the dataset using melt() and pivot() functions. Understanding how pivot and pivot_table can be used in different situations.
  • We looked into how to use groupby() and agg() functions in best way possible and how to make most of these functions available at our disposal.
  • There is ofcourse a lot more to do with them, but if you get the concept and idea about how to use it and how does the function works initially, then you can use it in numerous ways.

Next article in this series is Plotting and Visualization using Pandas.

Other articles.

  • Part 1 and Part 3 are linked here, do check it out incase if you haven’t already.

--

--

Vijay yadav
Analytics Vidhya

Data Science Consultant @Tata Consultancy Services. Currently in USA