Data Frames

In my previous blog, I talked about using the Series Data Structure in Pandas. Feel free to have a look at that blog.

This blog post will be a brief walk through of Data Frames. Data Frame is one of the most widely used tool in python by Data Scientists/Analyst.

A DataFrame can be considered as a data table. Data Frame provides a wide range of functionality like

  • filtering
  • grouping
  • sorting
  • joins
  • merging

Let’s start with creating a data frame

#importing pandas and DataFrame
import pandas as pd
from pandas import DataFrame
#constructor to create a data frame
#df=DataFrame( data, index, columns, dtype, copy)
#Lists, dict, Series, Numpy ndarrays, Another DataFrame
#creating an empty data frame
df=DataFrame()
#creating a data frame form an array
df=DataFrame([1,2,3,4,5,5])
print df
Output:0
0 1
1 2
2 3
3 4
4 5
5 5
# creating a dataframe with an array of array
# each inner array represents a row
d2=[[1,2,3,4],
['a','b','c','d']]
df= DataFrame(d2)
print df
Output:0 1 2 3
0 1 2 3 4
1 a b c d
#creating a datframe from a dict
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Age','Name','a'])
print df
Output:Age Name a
0 28 Tom NaN
1 34 Jack NaN
2 29 Steve NaN
3 42 Ricky NaN
# creating a df from a list of dictionaries
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
Output:a b c
0 1 2 NaN
1 5 10 20.0

Now that we know how to create a dataframe. Let’s try some functions and statistics with Data Frames

#create a data frame
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Age','Name','a'])
print df
Output:Age Name a
0 28 Tom NaN
1 34 Jack NaN
2 29 Steve NaN
3 42 Ricky NaN
#understand the data
#gives brief statistics about the data like mean,count,standard deviation, percentile limits
print df.describe()
Output:Age
count 4.000000
mean 33.250000
std 6.396614
min 28.000000
25% 28.750000
50% 31.500000
75% 36.000000
max 42.000000
#gives information about the data frame like numbr of columns,adta types memory usage
print df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
Age 4 non-null int64
Name 4 non-null object
a 0 non-null object
dtypes: int64(1), object(2)
memory usage: 168.0+ bytes
None
#to get all the columns
print df.columns
print "---------"
# to print all the values
print df.values
print "---------"
# to print the index
print df.index
Output:Index([u'Age', u'Name', u'a'], dtype='object')
---------
[[28 'Tom' nan]
[34 'Jack' nan]
[29 'Steve' nan]
[42 'Ricky' nan]]
---------
RangeIndex(start=0, stop=4, step=1)

Indexing in DataFrame

# selecting a column from the dataframe
print df['Name']
print "----------"
#selecting multiple columns
print df[['Name','Age']]
print "-----------"
# adding a new column
df['Gender']=['M','M','M','M']
print df
print "------------"
#deleting a column from data frame
del df['Gender']
# or df.pop('a')
Output:0 Tom
1 Jack
2 Steve
3 Ricky
Name: Name, dtype: object
----------
Name Age
0 Tom 28
1 Jack 34
2 Steve 29
3 Ricky 42
-----------
Age Name a Gender
0 28 Tom NaN M
1 34 Jack NaN M
2 29 Steve NaN M
3 42 Ricky NaN M
------------

For selecting values from a DataFrame we have 3 functions

  • iloc = For accessing the data frame with integer indexing
  • loc = For accessing the dataframe with label indexing
  • ix = For accessing teh dataframe with mixture of integer and label indexing
# selecting a row by integer index
df.iloc[1,1]
Output:
'Jack'# selecting a row by label index
df.loc[1,"Name"]
Output:
'Jack'# selecting a row by integer index
df.ix[3,1]
Output:
'Ricky'

Iterating over the Data Frame Rows

for row in df.iterrows():
#returns the tuples with (label,row) mapping
#check the data type of tuple keys
print "return type of iterrows ",type(row[0])
print "values of tupe are of type ",type(row[1])
break

for row in df.iterrows():
print "Name is : ",row[1]['Name']
print "Age is : ",row[1]['Age']
Output:return type of iterrows <type 'numpy.int64'>
values of tupe are of type <class 'pandas.core.series.Series'>
Name is : Tom
Age is : 28
Name is : Jack
Age is : 34
Name is : Steve
Age is : 29
Name is : Ricky
Age is : 42
# returns the columns with values in columns
for row in df.iteritems():
print row[0]
print row[1]
Output:Age
rank1 28
rank3 34
rank2 29
rank4 42
Name: Age, dtype: int64
Name
rank1 Tom
rank3 Jack
rank2 Steve
rank4 Ricky
Name: Name, dtype: object
a
rank1 NaN
rank3 NaN
rank2 NaN
rank4 NaN
Name: a, dtype: object
# gives rows as tuples
for row in df.itertuples():
print row
Output:Pandas(Index='rank1', Age=28, Name='Tom', a=nan)
Pandas(Index='rank3', Age=34, Name='Jack', a=nan)
Pandas(Index='rank2', Age=29, Name='Steve', a=nan)
Pandas(Index='rank4', Age=42, Name='Ricky', a=nan)

Reindexing a DataFrame

Reindexing a dataframe creates a new Dataframe with the desired index from the orignal dataframe.

print df
df1=df.reindex(index=[1,2],columns=['Name','Age'])
df1
Output:Age Name a
0 28 Tom NaN
1 34 Jack NaN
2 29 Steve NaN
3 42 Ricky NaN
#changing the index values of a Data Frame
df.index=['rank1','rank3','rank2','rank4']

Sorting a DataFrame

Sorting can be done in 2 ways

  • Sort by indexes
  • Sort y values
import numpy as np
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
#sorting can be done in
#sorting based on index
print unsorted_df.sort_index()
print "-----------------------------"
print unsorted_df.sort_index(ascending=False)
Output:col2 col1
0 0.991140 -1.203227
1 -1.335899 1.494410
2 0.182436 0.167360
3 0.379386 -0.028920
4 -0.076693 1.179620
5 0.418442 -0.282703
6 -0.136885 0.812825
7 0.808891 0.336993
8 2.177074 1.805119
9 1.240525 0.104993
-----------------------------
col2 col1
9 1.240525 0.104993
8 2.177074 1.805119
7 0.808891 0.336993
6 -0.136885 0.812825
5 0.418442 -0.282703
4 -0.076693 1.179620
3 0.379386 -0.028920
2 0.182436 0.167360
1 -1.335899 1.494410
0 0.991140 -1.203227
#sorting based on values
#meathod takes a column name or list of columns as input
unsorted_df.sort_values('col1')
unsorted_df.sort_values(['col1','col2'])
#### Statistical Functions with DataFrame#calculate the percent change on rolling window of 1
print unsorted_df.pct_change(1)
print "---------------"
print "correlation matrix"
print unsorted_df.corr()
Output:col2 col1
1 NaN NaN
4 -0.942591 -0.210645
6 0.784835 -0.310944
2 -2.332776 -0.794100
3 1.079551 -1.172798
5 0.102945 8.775505
9 1.964630 -1.371389
8 0.754962 16.192776
0 -0.544738 -1.666564
7 -0.183878 -1.280075
---------------
correlation matrix
col2 col1
col2 1.000000 -0.157285
col1 -0.157285 1.000000
#rolling functions are window based functions which can be applied on a a set of rows
# calclulate the mean based on every 4 rows in a data frame
print df.rolling(window=2).mean()
print df.rolling(window=2).sum()
#applying custome function to adta frame
def fun(s):
print s
return s[0]
print df.rolling(window=2).agg(fun)
Output:Age Name a
rank1 NaN Tom NaN
rank3 31.0 Jack NaN
rank2 31.5 Steve NaN
rank4 35.5 Ricky NaN
Age Name a
rank1 NaN Tom NaN
rank3 62.0 Jack NaN
rank2 63.0 Steve NaN
rank4 71.0 Ricky NaN
[28. 34.]
[34. 29.]
[29. 42.]
Age Name a
rank1 NaN Tom NaN
rank3 28.0 Jack NaN
rank2 34.0 Steve NaN
rank4 29.0 Ricky NaN

Grouping a DataFrame

# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
#group by creates groups of the data frame rows
print df.groupby('Team').groups
#printing the group names
for name,group in df.groupby('Team'):
print name
print group
Output:{'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Devils': Int64Index([2, 3], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}
Devils
Points Rank Team Year
2 863 2 Devils 2014
3 673 3 Devils 2015
Kings
Points Rank Team Year
4 741 3 Kings 2014
6 756 1 Kings 2016
7 788 1 Kings 2017
Riders
Points Rank Team Year
0 876 1 Riders 2014
1 789 2 Riders 2015
8 694 2 Riders 2016
11 690 2 Riders 2017
Royals
Points Rank Team Year
9 701 4 Royals 2014
10 804 1 Royals 2015
kings
Points Rank Team Year
5 812 4 kings 2015
#applying sum function in a dataframe
print df.groupby('Team').sum()
#applying sum function on a column
print df.groupby('Team')['Points'].sum()
#or
print df.groupby('Team')['Points'].agg(np.sum)
Output:Points Rank Year
Team
Devils 1536 5 4029
Kings 2285 5 6047
Riders 3049 7 8062
Royals 1505 5 4029
kings 812 4 2015
Team
Devils 1536
Kings 2285
Riders 3049
Royals 1505
kings 812
Name: Points, dtype: int64
Team
Devils 1536
Kings 2285
Riders 3049
Royals 1505
kings 812
Name: Points, dtype: int64
# applying multiple functions
print df.groupby(['Rank'])['Points'].agg([np.sum, np.mean, np.std])
print df.groupby(['Rank','Team'])['Points'].agg([np.sum, np.mean, np.std])
Output:sum mean std
Rank
1 3224 806.0 50.754310
2 3036 759.0 83.070251
3 1414 707.0 48.083261
4 1513 756.5 78.488853
sum mean std
Rank Team
1 Kings 1544 772.000000 22.627417
Riders 876 876.000000 NaN
Royals 804 804.000000 NaN
2 Devils 863 863.000000 NaN
Riders 2173 724.333333 56.038677
3 Devils 673 673.000000 NaN
Kings 741 741.000000 NaN
4 Royals 701 701.000000 NaN
kings 812 812.000000 NaN

Hope the blog was helpful. Share with your fellow Python practitioners.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade