A Complete Guide and Mastery of Pandas Library

PANDAS COMPLETE GUIDE(A-Z)

Nikhil pentapalli
Analytics Vidhya
Published in
12 min readMay 2, 2020

--

Master the Pandas and become the Dragon Warrior in the world of data.

Why to Learn Pandas?

Pandas is an Essential Tool for those who wants to be an aspiring Data scientist.After this you can be the guy who can provide incredible insights with the data and can influence your managers to make some Critical Decisions or can add value to the business heads.

There are plenty of Amazing Techniques that are handy in pandas library.Also if you are well interested in machine learning then data pre-processing is the important step in building a model.This step also requires you have in depth knowledge of pandas library which is used in pre-processing the data as a good data indeed results to much accurate Machine Learning Model.

The Contents of this article are shown below so they can be directly accessed with just a click.It is not just learning theoretically.There is an implementation of the entire code from the beginner to the advanced level that has attached at the end of the article or and can be accessed “here”.

Bookmark this page,so that you can refer to your doubts just a click away.Just Navigate through the Contents below when ever required.

CONTENTS:

Chapter 1:

Chapter 2:

Chapter 3:

Once you mastered the above contents you can also access the following Tricks which i will keep on adding. Do visit the page for learning new tricks.

BONUS Chapter(Tricks and Mastering the Pandas):

  • Assign ‘=’ for two DataFrames and advantages of using copy() instead.
  • Filtering the data using conditions
    ->Specifying select condition like if and else
    ->Selecting specific dtypes
    ->Using value_counts() to get the specific stats
  • Replace specific value with NaN (replace ‘ ’ with NaN etc..)
  • Check for NaN values in the DataFrame
  • Replacing or filling the NaN(with ‘0’,’mean’,ffill method)
  • In Numerical DataFrame get row total and column total
  • Pivot Table using Pandas.

What is Pandas?

Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.The data may be manipulated into different data types and can be analysed using many inbuilt functions in pandas library.

Enough of Intro,Let us get into the Content.

Chapter 1:

1)Import pandas library and read csv or excel file into a DataFrame.

import pandas as pd
df=pd.read_csv(‘Automobile_data.csv’)
#if file format is xlsx use pd.read_excel
df.head(10)

The df.head() gives the first 5 rows of DataFrame as a sample to visualize.The count can be adjusted to required by passing number into it. df.head(10) gives 10 rows for example.

If DataFrame has data which is doesn’t contain columns names or if they are separated using a separator like ‘|’ or ‘,’ the pandas can able to handle such cases.

Reading the data with any separator assigning column names as showm:

df2=pd.read_csv(‘Automobile_data_without_col.csv’,sep=”|”,names=[‘company’,’body-style’,’wheel-base’])
df2.tail()

2)Let us understand some basic methods for analyzing the data.

df.columnsIndex(['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
'num-of-cylinders', 'horsepower', 'average-mileage', 'price'],
dtype='object')df.info()

.info method gives us the following details from the dataframe

df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 10 columns):
index 61 non-null int64
company 61 non-null object
body-style 61 non-null object
wheel-base 61 non-null float64
length 61 non-null float64
engine-type 61 non-null object
num-of-cylinders 61 non-null object
horsepower 61 non-null int64
average-mileage 61 non-null int64
price 58 non-null float64
dtypes: float64(3), int64(3), object(4)
memory usage: 4.8+ KB

.describe method gives us the statistics of the data in the DataFrame such as count,mean,standard deviation etc.

df.describe()

Chapter 2 :

Slicing the DataFrame

In the below example slicing from index 1 to 4 gives us the rows from 1 to 3.row with index 4 will not be selected.

#slicing of Dataframe
df[1:4]
#Accessing the data with the index value
df.loc[1]
#alternate method that is using iloc
df.iloc[1]
#output for both methods is shown below
The output for both loc and iloc of index 1

If you observe the output from loc and iloc they both are equal in this case. so what is the difference between loc and iloc. Let us get into detail.

loc vs iloc ?

loc gets rows (or columns) with particular labels from the index.While
iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

#accessing data in specific column
df[‘company’].head()
#the above can be done using iloc as shown below (slicing )
df.iloc[0:5][‘company’]
#using index 1 getting the values from the specified columns
df.loc[[1],[‘body-style’,’length’]]

Setting the Index to company so loc can be perfectly understood.

df1=df.set_index(‘company’)
df1.head()
df1.index.values 
#observe the output.there are many duplicates values .we will handle them getting only unique in the next steps.

The below code represents go to row containing alfa-romero and get all the given columns data accordingly

df1.loc[[‘alfa-romero’],[‘body-style’,’length’]]

Chapter 3:

cols=df.columns
print(cols)
Output:Index(['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
'num-of-cylinders', 'horsepower', 'average-mileage', 'price'],
dtype='object')

let us now print all the names of the columns

#print all columns names in order
for i in cols:
print(i)
Output:index
company
body-style
wheel-base
length
engine-type
num-of-cylinders
horsepower
average-mileage
price

Let us eliminate or drop some columns using drop method(using head just to get the top 5 rows of the DataFrame)

df.drop([‘num-of-cylinders’,’average-mileage’],axis=1).head()
df[‘company’].unique() #output:
array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
'isuzu', 'jaguar', 'mazda', 'mercedes-benz', 'mitsubishi',
'nissan', 'porsche', 'toyota', 'volkswagen', 'volvo'], dtype=object)

There you go!we got unique company names as promised above.

Gives the Number of unique values(n stands for number)

df[‘company’].nunique()#output:
16

value_counts() Method:

This gives the count of different values in single column that is taken.

df[‘company’].value_counts()#output:
toyota 7
bmw 6
nissan 5
mazda 5
audi 4
mercedes-benz 4
mitsubishi 4
volkswagen 4
chevrolet 3
isuzu 3
alfa-romero 3
jaguar 3
porsche 3
honda 3
dodge 2
volvo 2
Name: company, dtype: int64

CONCAT Function in Pandas:

(Note: In appending rows the names of the columns in two DataFrame must be same. Else it will throw an error)

df3=df[3:8]
new_data=pd.concat([df,df3],axis=0)
df.info()
print('#######')
new_data.info()
#Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 10 columns):
index 61 non-null int64
company 61 non-null object
body-style 61 non-null object
wheel-base 61 non-null float64
length 61 non-null float64
engine-type 61 non-null object
num-of-cylinders 61 non-null object
horsepower 61 non-null int64
average-mileage 61 non-null int64
price 58 non-null float64
dtypes: float64(3), int64(3), object(4)
memory usage: 4.8+ KB
#######
<class 'pandas.core.frame.DataFrame'>
Int64Index: 66 entries, 0 to 7
Data columns (total 10 columns):
index 66 non-null int64
company 66 non-null object
body-style 66 non-null object
wheel-base 66 non-null float64
length 66 non-null float64
engine-type 66 non-null object
num-of-cylinders 66 non-null object
horsepower 66 non-null int64
average-mileage 66 non-null int64
price 63 non-null float64
dtypes: float64(3), int64(3), object(4)
memory usage: 5.7+ KB

Rename the columns

df4=df3.rename(columns={‘company’:’brand’})
df4

MERGE:

merged_df=df.merge(df4, how=’left’,left_on=’company’, right_on=’brand’)merged_df.head()  
#merged the dataframe on company and brand on the left(which takes keys from left dataframe ie df)
#_x and _y are attached as the names are same in both dataframes

JOIN:

Join is similar to that the merge but offers lower level of control.It combines columns from two tables with the common columns renamed with the defined lsuffix and rsuffix.You can define the way of combination using ‘how’

In the example lsuffix and r suffix are defined.you can also join DataFrame by passing ‘on’ parameter which takes the column name on which join is to be performed.

how{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

How to handle the operation of the two objects.

  • left: use calling frame’s index (or column if on is specified)
  • right: use other’s index.
  • outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.
  • inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.
df_joined=df.join(df2,how=’left’,lsuffix=’ — x’,rsuffix=’ — y’)
df_joined.head()

GroupBy function in Pandas:

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.


#Groupby demonstrated in simplest way possible
grouped=df.groupby(‘company’)
grouped
#Output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A67A64A400>

This create a groupby object for which many functions can be applied that are demonstrated below

grouped1=df.groupby(‘body-style’)
grouped1.groups
#Output:
{'convertible': Int64Index([0, 1, 46], dtype='int64'),
'hardtop': Int64Index([35, 45], dtype='int64'),
'hatchback': Int64Index([2, 13, 14, 16, 17, 27, 28, 29, 30, 36, 37, 47, 48, 49, 50], dtype='int64'),
'sedan': Int64Index([ 3, 4, 5, 7, 8, 9, 10, 11, 12, 15, 19, 20, 21, 22, 23, 24, 25,
26, 31, 32, 34, 38, 39, 40, 41, 42, 44, 55, 56, 57, 58, 59],
dtype='int64'),
'wagon': Int64Index([6, 18, 33, 43, 51, 52, 53, 54, 60], dtype='int64')}

Select a specific group
By calling get_group with the name of the group, we can return the respective subset of the data.

grouped1.get_group(‘sedan’).head()

Once we obtain groupby object we can apply size funtion as shown

grouped1.size()
#Output
body-style
convertible 3
hardtop 2
hatchback 15
sedan 32
wagon 9
dtype: int64

Another important function that can be applied is ‘mean’ of the groupby object for given column.

grouped[‘average-mileage’].mean() 
## we get average mileage of car belonging to different companies
##Output:
company
alfa-romero 20.333333
audi 20.000000
bmw 19.000000
chevrolet 41.000000
dodge 31.000000
honda 26.333333
isuzu 33.333333
jaguar 14.333333
mazda 28.000000
mercedes-benz 18.000000
mitsubishi 29.500000
nissan 31.400000
porsche 17.000000
toyota 28.714286
volkswagen 31.750000
volvo 23.000000
Name: average-mileage, dtype: float64

Aggregate(.agg):

Aggregate is a way of aggregating groups .The .agg can work with function names (i.e., strings) or actual function (i.e., Python objects). So this gives us overall aggregate of all the required data in required format that can be customized according to our need.


df.groupby(‘company’).agg({
‘body-style’:’size’,
‘average-mileage’:[‘sum’,’mean’],
‘price’:[‘sum’,’mean’]
})

Here we are passing a dictionary to the aggregate function in which body-style ,average-mileage, price are the keys which are columns names and values are the functions that are being applied like (size,[‘sum’,’mean’]) etc..

Most Commonly used functions for agg include the follows:
‘size’: Counts the rows
‘sum’: Sums the column up
‘mean’/’median’: Mean/Median of the column
‘max’/’min’: Maximum/Minimum of the column
‘idxmax’/’idxmin’: Index of the maximum/minimum of the column. Getting the index of the minimal or maximal value is helpful for mapping other columns.
pd.Series.nunique: Counts unique values. Note that, unlike the previous functions, this is an actual function and not a string.

(Read the official documentation page for more info on groupby).

  • Map, Apply, ApplyMap

1)using Map to see if mileage is good or not.(lambda function is applied to all the values in the average-mileage column)

df[‘good milege?’]=df[‘average-mileage’].map(lambda mileage:’yes’ if mileage>=22 else ‘no’)
df.head()

2)using Apply function to the DataFrame
Let us convert the horse power(hp) to kW(kilo watts).1hp=0.745kW

def converttokW(x):
out=x*0.745
return out

df[‘horsepower(kW)’]=df[‘horsepower’].apply(lambda x:converttokW(x))
df.head()

3)Using applyMap in the DataFrame. It is mostly used to apply for all the elements in DataFrame ie., datatype of all elements are same. Let us take the DataFrame with elements of similar datatype.

columns=[‘wheel-base’,’length’,’horsepower’,’average-mileage’]
df_new=df[columns]
df_new.head()
#multiplying every element by 4
df_new=df_new.applymap(lambda x:x*4)

Final chapter:

Tricks and Mastering the Pandas

  1. Assign ‘=’ for two DataFrames.
new_df=df3
new_df.head()
new_df[‘horsepower(inKw)’]=””
new_df[‘horsepower(inKw)’]=new_df[‘horsepower’].apply(converttokW)
new_df.head()

Now let us see df3.head()

df3.head()

If Observed the above two DataFrames are same as the we assigned them as equal.so changing in one DataFrame also changes values in another.

To Overcome this problem we use copy() function which just copies the DataFrame. So, the solution is use the line below and perform the other steps as did previously.

new_df=df3.copy()

Filter the DataFrame using conditions.Here we defined a condition where we want the DataFrame with only body-style is equal to ‘wagon’


df_filtered=df[df[‘body-style’]==’wagon’]

Using value_counts()

To know different types and count of data that is present in DataFrame.

df.dtypes.value_counts()

To show the stats sorted by distinct values in the column ‘company’.

df[‘company’].value_counts().reset_index().sort_values(by=’index’)
#alternate Method
df['company'].value_counts().sort_index()
df_choice=df.select_dtypes(include=['int64','float64'])

Now replacing any value with “ “ just to manipulate the data for the visualizing :-)
you can replace any value with any using the below format:


import numpy as np
# because 111 is in int format
df_replace=df_select.replace( {111:” “,102: “ “})
#if the value to be replaced is string it needed to be in ‘ ‘.
#df_replace=df_select.replace( {‘hi’:” “,’hello’: “ “})
df_replace.head()
df_new=df_replace.copy()
df_new = df_new.replace(r’^\s*$’, np.NaN, regex=True)
df_new.head()
print(df_new.isnull().sum())

Now you can either remove rows with NaN or you can replace those values with any value like ‘zero’ or “meanvalue” of the column according to your dataset.


df_new1=df_new.replace(np.nan,0) #method 1 ie using numpy
df_new2=df_new.fillna(0) #method 2 ie using pandas built in
df_new3[‘horsepower’]=df_new[‘horsepower’].fillna(df_new[‘horsepower’].mean())
#note:You can also apply these methods just to specific columns of your choice
print(df_new1.head())
print(df_new3.head())

you can also use ‘ffill’ which forward fills the data that is take the previous value in the columns which is not empty.


df_new4=df_new.fillna(method=’ffill’,inplace=True)
df_new.head()

if you observe horsepower column the row with index 3 has horse power of 154 which got replaced by using ffill which takes the previous value and fills it in.
PS:observe that 1st row has horsepower NaN. So even 2nd row is also unaffected.

df_new1=df_new1.astype(int) 
df_new1[‘col_total’] = df_new1.apply(lambda x: x.sum(), axis=1)
df_new1.loc[‘row_total’] = df_new1.apply(lambda x: x.sum())
df_new1.head()
pivot=pd.pivot_table(df,index=[“company”],values=[“price”],aggfunc=[np.mean])pivot.head()

Here pd.pivot is used to generate a pivot table for visualization of the data in an organised way.

It is more flexible because you can define custom aggregation functions.

pivot=pd.pivot_table(df,index=[“company”,”average-mileage”],values=[“price”],aggfunc=[np.mean,len])pivot.head()

Voila!! you have mastered the pandas library to a great level. Practice is the key so just go through the attached Jupyter notebook and try manipulating the code and experiment on different functions of pandas.That is the best way to Master the pandas.

You can access the Jupyter notebook via Colab and can download data and add into Colab after connecting to runtime :)

GitHub Repo for jupyter notebook and Dataset:

https://github.com/psssnikhil/medium

LinkedIn:https://www.linkedin.com/in/nikhil-pentapalli-5744bb18b/

I hope this article empowers your knowledge.Keep supporting and Happy Learning.

--

--

Nikhil pentapalli
Analytics Vidhya

Data Scientist,Machine learning Engineer|Love to Share the knowledge and empower data science enthusiasts.