Let’s learn Panda

image source : imgur

Pandas…. Abbreviation of Panel data. Panda is a Python library module for data analysis and data manipulation. Data, as it is said, is one of the most important aspect in the field of Machine Learning. Machines cannot be trained until and unless it is provided with data (because Duh!!! What will it learn ???). That being said, In this world of infinite type of data formats and their uneven scattering, Data will not present itself every time in a decorated and beautiful manner(Damn! Its a big mess). This is where Panda comes to the rescue. Panda makes data cleaning, formatting, refining and all sorts of data manipulation millions of times faster and efficient ! Its ease of use, open source nature, Fantastic documentation, Huge community, active development since its initial release, are just some of the features that panda owes to for its fame among the data scientist.

Without any further due let’s get our hands dirty and get familiar with some of the basic functionalities Panda has to offer.

NOTE: In this post I am going to assume that you already have panda installed on your machine. If not, you might want to do that first i.e. ‘pip install pandas


Lets start with some basic panda setup and usage!

#basic usage
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
stats = {
'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce_Rate':[65,67,78,65,45,52]
}
df = pd.DataFrame(stats)
print(df.head())
print(df.tail())
print(df.head(2))
print(df.tail(2))
print(df.shape)
df.set_index('Day', inplace = True)
print(df)
df.index.name = 'Day'
df.reset_index(inplace = True)
print(df)

Now lets move on to some of the basic column operation.

#column operations
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
stats = {
'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce_Rate':[65,67,78,65,45,52]
}
df = pd.DataFrame(stats)
AXIS = 0 STANDS FOR ROWS
df.drop(['Day', 'Visitors'], axis = 1, inplace = True)
print(df.Bounce_Rate)
print('Bounce_Rate')
df.rename(columns = {'Day': 'day', 'Visitors': 'visitors'}, inplace = True)
print(df.columns)

Some basic Filtering and Sorting operations can be done as follows.

#filtering and sorting
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use('fivethirtyeight')
stats = {
'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce_Rate':[65,67,78,65,45,52]
}
df = pd.DataFrame(stats)
x = df.Visitors.sort_values()
print(x)
'''
OR another method to do the same would be:
x = df['Visitors'].sort_values()
print(x)
'''
x = df.sort_values('Day')
print(x)
x = df[df.Day > 2]
print(x)
x = df[(Day > 2) and (Visitors > 29)]
print(x)
y = df[(Day > 2) or (Visitors > 29)]
print(y)

Let’s read from a csv file and write to a new csv file.

#csv operations
import pandas as pd
df = pd.read_csv('somecsvfile.csv')
df.to_csv('anothercsvfile.csv')
df['somecolumn'].to_csv('yetanothercsvfile.csv')
df = pd.read_csv('somecsvfile.csv', usecols = ['column_name', 'another_column_name'])
df = pd.read_csv('somecsvfile.csv', index_col = 0)
df.rename(columns = {'originalname' : 'newname'}, inplace = True)

Let’s read from an HTML page.

import pandas as pd
df = pd.read_html('here goes the url link of the website')
print(df)
print(df[0])
print(df[0][0])
for item in df[0][0][1:]:
print(str(item))

Now, lets try to join two pandas dataframe in 4 different ways namely Joining, Concatenating, Appending and Merging. All of them might seem to work in a similar manner at first glance but dont be mistaken because they all have very unique and different methodologies of operations.

#CONCATENATING AND APPENDING
import pandas as pd
df1 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2001, 2002, 2003, 2004])
df2 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'US_GDP_Thousands':[50, 55, 65, 55]},
index = [2005, 2006, 2007, 2008])
df3 = pd.DataFrame({'HPI':[80,85,88,85],
'Int_rate':[2, 3, 2, 2],
'Low_tier_HPI':[50, 52, 50, 53]},
index = [2001, 2002, 2003, 2004])
concat_1 = pd.concat([df1, df2])
print concat
concat_2 = pd.concat([df1, df2, df3])
print(concat_2)
df4 = df1.append(df2)
print(df4)
s = pd.Series([40, 90, 65], index = ['HPI', 'Int_rate', 'US_GDP_Thousands'])
df5 = df1.append(s, ignore_index = True)
print(df5)

#JOINING AND MERGING
import pandas as pd
df1 = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
df2 = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
first_merge = pd.merge(df1, df2, on = 'id')
print(first_merge)
second_merge = pd.merge(df1, df2, on = ['id', 'subject_id'])
print(second_merge)
# MERGE USING 'HOW' ARGUMENT
"""
THERE ARE 4 TYPES OF MERGING METHOD USING 'HOW' ARGUMENT
1) LEFT OUTER
2) RIGHT OUTER
3) FULL OUTER
$) INNER
"""
left_merge = pd.merge(df1, df2, on = 'subject_id', how = 'left')
print(left_merge)
right_merge = pd.merge(df1, df2, on = 'subject_id', how = 'right')
print(right_merge)
outer_merge = pd.merge(df1, df2, on = 'subject_id', how = 'outer')
print(outer_merge)
inner_merge = pd.merge(df1, df2, on = 'subject_id', how = 'inner')
print(inner_merge)
'''
JOINING WILL BE PERFORMED ON INDEX. JOIN OPERATION HONORS THE OBJECT ON WHICH IT IS CALLED. SO, A.JOIN(B) IS NOT EQUAL TO B.JOIN(A).
'''
df1.set_index('subject_id', inplace=True)
df3.set_index('subject_id', inplace=True)
right_joined = df1.join(df3, how="right")
print(right_joined)
right_joined = df1.join(df3, how="left")
print(left_joined)
right_joined = df1.join(df3, how="inner")
print(inner_joined)
right_joined = df1.join(df3, how="outer")
print(outer_joined)

One very common occurring instance in data sets is the missing of values. A NAN(Not A Number). To handle those missing values, Panda has some built in features which we can utilize.

#handling missing data
import pandas as pd
df = pd.read_csv('http://bit.ly/uforeports')
df.isnull().tail()
df.notnull().tail()
df.isnull().sum()

df[df.City.isnull()]
df.dropna(how = 'any')
df.dropna(how = 'all')
df.dropna(subset = ['City', 'Shape Reported'], how = 'any')\
df.fillna(value = 'heregoesthevalueyouwanttofill')

Quandl is an online website with tons of dataset in a variety of fields (e.g: American housing price indexes, equities, commodities, foreign exchange and more). For more on Quandl visit https://www.quandl.com . Let’s setup Quandl with pandas to fetch data from Quandl and use it in analysis.

#setting up Quandl with pandas
import quandl
api_key = 'yoursuperamazingquandlAPIkey'
df = quandl.get('heregoesthequandlcode', authtoken = api_key)
print(df.head())

Here we conclude this post of dirtying our hands with Pandas module library. Pandas is a very feature rich module and has a Myriad of functionalities to offer, most of which cannot be readily covered in a single blog post or even in a series of blog posts. This post was intended to get your hands adept with the library and to show its capabilities, but it does not cover even the slightest part of what truly pandas has to offer. Thus if you are inspired to learn more of pandas then please head to the official docs page.

Footnote : I have a Github repository which contains all of the above code in a very well commented structure alongside all the necessary references that I have used to explore this beautiful library. I will keep updating this repository as the time passes and as I learn something worth updating.

If you liked this article, please don’t hesitate in clapping.

Stay Tuned. Until next time !