Pandas -Fundamentals of Data Analysis

azam sayeed

Published in

Analytics Vidhya

7 min readSep 20, 2019

Open-source library used for Data Manipulation and Analysis
Panda represents Panel Data, which is a multi -dimensional data involving measurements over time

Features of Pandas

Series Object and DataFrame — used for dealing with 1D(Series obj)and 2D labelled data(Data Frame) and Panel data (3D,time series)
Handling of Missing data — NaN
Data Alignment — User can ignore the predefined labels and let Pandas auto label the data for you
Group by functionality
Slicing, Indexing and subletting
Merging and Joining
Reshaping
Hierarchical labeling of axes
Robust I/p,O/p tool — useful for handling Tabular,Arbitrary and Time series data types.
Time series specific functionality

Pandas performs better than numpy for large dataset 500K more.

Pandas Series Object is more flexible as we can define the labeled index to access elements of array , whereas in Numpy accessed through default integer position only

import pandas as pd
a= pd.read_csv(‘ …‘)

Series Object

One Dimensional labeled array
Contains data of Similar or mixed types(heterogeneous data)

import pandas as pd
data = [1,2,3,4]
series1 = pd.Series(data)
series1
o/p:
0    1
1    2
2    3
3    4
dtype: int64type(series1)
o/p:
pandas.core.series.Series#Changing the index of a series object, data can be array, dictonary # or scalar
series1 = pd.Series(data, index =['a','b','c','d'])
series1
o/p:
a    1
b    2
c    3
d    4
dtype: int64#series1 = pd.Series(data, index =['a','b','c']) <-- notice only3
#Error: ValueError: Length of passed values is 4, index implies 3

Pandas DataFrame

Two Dimensional labeled data structures with columns of potentially different types
Different column types, Mutable sizes, labeled axis, arithematics operation can be applied to both rows and columns#create a dataframe

# creating a DataFrame using a list
import pandas as pd
data = [1,2,3,4,5]
df =pd.DataFrame(data)
df

#creating a dataframe using a series
series = pd.Series([6,12],index=['a','b'])
df = pd.DataFrame(series)
df

#creating a dataframe using a dictionary
dictionary ={'fruits':['apples','banana','mangoes'],'count':[10,20,15]}
df = pd.DataFrame(dictionary)
df

#creating a datafram using a numpy array
import numpy as np
import pandas as pdnumpyarray = np.array([[50000,60000],['John','Sam']])
df =pd.DataFrame({'name':numpyarray[1],'salary':numpyarray[0]})
df

Merge, Join and Concatenate

Merge

import pandas as pd
player =['Player1','Player2','Player3']
point =[8,9,6]
title = ['Game1','Game2','Game3']
df1 =pd.DataFrame({'Player':player,'Points':point,'Title':title})
df1

player =['Player1','Player5','Player6']
power =['Punch','Kick','Elbow']
title = ['Game1','Game5','Game6']
df2 =pd.DataFrame({'Player':player,'Power':power,'Title':title})
df2

#inner merge
df1.merge(df2, on = 'Title', how ='inner')

df1.merge(df2) # by default it is inner merge

#left merge
df1.merge(df2,on='Player',how ='left')

#Right Merge
df1.merge(df2, on ='Player',how='right')

#Outer Merge
df1.merge(df2, on ='Player', how ='outer')

Join

join- based on index value not on attr names in DataFrame
All attributes names of the data frames to be joined should be different otherwise error is thrown . Error : ValueError: columns overlap but no suffix specified: Index([…], dtype=’object’)

player =['Player1','Player2','Player3']
point =[8,9,6]
title = ['Game1','Game2','Game3']
df3 =pd.DataFrame({'Player':player,'Points':point,'Title':title},index=['L1','L2','L3'])
df3

player =['Player2','Player3','Player4']
point =[38,39,36]
title = ['Game2','Game3','Game4']
df4 =pd.DataFrame({'Playerx':player,'Pointsx':point,'Titlex':title},index=['L2','L3','L4'])
df4

df3.join(df4,how ='inner')df3.join(df4,how='left')


df3.join(df4,how='right')

df3.join(df4,how='outer')

#how to concatenate dataframes in pandas
pd.concat([df3,df4], sort ='True')

Importing and Analysis the Dataset

DataSet: https://www.dropbox.com/s/9zg3u3zqj3ct3zf/mtcars2.csv?dl=0

Importing the dataset to pandas dataframe

#import pandas library
import pandas as pd
#read dataset and store into a dataframe
cars=pd.read_csv("mtcars2.csv")
#print the dataframe
cars

Analysis of Data set

#check the type
type(cars)
o/p:
pandas.core.frame.DataFrame#view only the first five records
cars.head()#view only the first ten records
cars.head(10)#view only the last five records
cars.tail()#view only the last ten records
cars.tail(10)#view number of rows and columns in the dataframe
cars.shape
o/p: (32,13)#print a concise summary of the columns
cars.info(null_counts=True)

#mean
cars.mean()

Other statistical measures on column wise

#median
cars.median()#standard deviation
cars.std()#maximum of each attribute
cars.max()#minimum of each attribute
cars.min()#number of non-null records in each column
cars.count()#descriptive statistics summary
cars.describe()

Cleaning dataset

#rename column (1)
cars=cars.rename(columns={'Unnamed: 1':'model'})
#cars#Fill the null values with mean of the column  (2)
cars.qsec=cars.qsec.fillna(cars.qsec.mean())
#cars#drop unwanted column - no value in analysis NaN(3)
cars = cars.drop(columns=['S.No'])
cars

#find correlation matrix
df=cars[['mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']].corr()
#df#So change mpg from string to float to perform correlation
#see the change yellow highlights -difference compared to previous #info resultcars.mpg = cars.mpg.astype(float)
cars.info(null_counts=True)

#Again find the correlation including mpg
df=cars[['mpg', 'cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']].corr()
df

1- highest correlation against same attributes
2- -vely correlated mpg vs cyl , i.e 1 unit increase in mpg will result in 0.852 decrease in folds +vely correlated is cyl vs disp
3 - Ignore attributes that are barely correlated Example — yellow highlight carb vs drat

Manipulation

Indexing by position

#view drat column only, Dont forget about [] brackets
cars.iloc[:,5]
#first five records of hp column
cars.iloc[0:5,4]#all rows, all columns
cars.iloc[:,:]#for attributes from hp to carb see all the records from index 6
cars.iloc[6:,4:]

#Now we want to look at all the rows and only the first column
cars.iloc[:,1]
cars.iloc[:,lambda df: [1,2,5]]
# alternatively - cars.iloc[:,[1,2,5]]

Similar operation possible using column labels loc

#see all the record of mpg column
cars.loc[:,"mpg"]#display the records from index 0 to index 6 from mpg column
cars.loc[:6,"mpg"]#see the first 7 records from mpg to qsec column
cars.loc[:6,"mpg":"qsec"]

#set value 1 to column 'am'
cars['am'] = 1
cars


#double up records in 'am' using lambda fxn
f = lambda x: x*2
cars['am']= cars['am'].apply(f)
cars

Documentation:

pandas.DataFrame.iloc - pandas 0.25.1 documentation

Purely integer-location based indexing for selection by position. is primarily integer position based (from to of the…

pandas.pydata.org

Sorting

#sort cyl in descending order
cars.sort_values(by='cyl', ascending=False)

Filtering

#filter records with more than 6 cylinders
cars['cyl'] > 6
o/p:
0     False
1     False
2     False
3     False
4      True
5     False
6      True
7     False
8     False
9     False
10    False
...#filter records with more than 6 cylinders
filter1 = cars['cyl'] > 6
#apply filter to dataframe
filtered_new = cars[filter1]
#view filtered dataframe
filtered_new#filtering with multiple columns
filtered_new = cars[(cars['cyl'] > 6) & (cars['vs'] == 0)]
#view filtered dataframe
filtered_new

More Examples:

Dataset: Iris Data Set

Load the Iris data set , have glance of first 5 and last 5 rows of dataset. Also find the number of rows&columns

import pandas as pd
iris =pd.read_csv('IRIS.csv')
print(iris.head(5))
print(iris.tail(5))
print(iris.shape)

2. Find out Simple summary Statistics on the dataSet

3. Change column name of sepal_width as sepWid , and find correlation between petal_length and petal_width

4. Extract all rows from 10 to 15 , columns: petal_length and petal_width

5. Sort the dataset based on petal_length