A Gentle Introduction to Pandas

Wilson Busaka

I hope you had an exciting intro to Numpy because Pandas builds on it. What is Pandas? Forget about the song ‘Panda’ by American hiphop artiste Desiigner or the animal (KungFu Panda enthusiasts), this is a python library that offers powerful and flexible data structures that make data manipulation and analysis easy. It stands for “Python Data Analysis Library” and according to Wikipedia, the name Pandas is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.

Pandas has three data structures: Series, DataFrames and Panels.

  1. A Series is 1-dimensional labelled array that can hold data of any type (integer, string, float, python objects, etc.). It’s axis labels are collectively called an index.
  2. A DataFrame is 2-dimensional labelled data structure with columns
  3. A panel is 3-dimensional. For this post we will not be discussing about panels.

Series

The syntax is: pd.Series( data, index, dtype, copy)

Console:
import pandas as pd
import numpy as np
data = np.array([‘Tom’,’Jerry’,’Nick’,’Harry’,’Ruth’,’Gloria’])
names = pd.Series(data)
print (names)
Output:
0 Tom
1 Jerry
2 Nick
3 Harry
4 Ruth
5 Gloria
dtype: object

Create a series from array with index

Console:
import pandas as pd
import numpy as np
data = np.array([‘Tom’,’Jerry’,’Nick’,’Harry’,’Ruth’,’Gloria’])
names = pd.Series(data, index=[100,101,102,103,104,105])
print (names)
Output:
100 Tom
101 Jerry
102 Nick
103 Harry
104 Ruth
105 Gloria
dtype: object
Notice the difference in the numbering, it changed from 1–5 to 100–105.

Create a series from Dictionary

Console:
import pandas as pd
import numpy as np
data = {
‘student’ : [‘Tom’, ‘Jerry’, ‘Gloria’, ‘Hillary’],
‘age’ : [21, 34, 45, 67],
‘gender’ : [‘Male’, ‘Female’, ‘Female’, ‘Male’]
}
Student = pd.Series(data)
print (Student)
Output:
student [Tom, Jerry, Gloria, Hillary]
age [21, 34, 45, 67]
gender [Male, Female, Female, Male]
dtype: object

Create a Series from Scalar

Console:
import pandas as pd
import numpy as np
score = pd.Series(10, index=[‘A’, ‘B’, ‘C’])
print (score)
Output:
A 10
B 10
C 10
dtype: int64

Accessing data from series with position

Console:
import pandas as pd
import numpy as np
data = np.array([1,2,3,4,5])
position = pd.Series(data)
Console:
position[0] #first element in the array
Output: 1Console:
position[:3] #first three elements in the array
Output:
0 1
1 2
2 3
dtype: int64
Console:
position[-1:] #the last element in the array
Output:
4 5
dtype: int64

DataFrames

Those that have used R before can relate to this data structure as it was inspired by R’s own dataframes. A dataframe can contain a Pandas DataFrame, Series, Numpy array or dictionaries of 1-dimensional.

Let’s go ahead and get our hands dirty.

Console:
import pandas as pd
import numpy as np
data = {
‘name’: [‘Kwadwo’, ‘Nana’, ‘Kwame’, ‘Naa’],
‘age’: [20, 19, 22, 21],
‘favorite_color’: [‘red’, ‘orange’, ‘green’, ‘purple’],
‘grade’: [67, 78, 90, 12]
}
df = pd.DataFrame(data)
print(df)
Output:
name age favorite_color grade
0 Kwadwo 20 red 67
1 Nana 19 orange 78
2 Kwame 22 green 90
3 Naa 21 purple 12
Console:
df.columns
Output:
Index([‘name’, ‘age’, ‘favorite_color’, ‘grade’], dtype=’object’)
Console:
df.values
Output:
array([[‘Kwadwo’, 20, ‘red’, 67],
[‘Nana’, 19, ‘orange’, 78],
[‘Kwame’, 22, ‘green’, 90],
[‘Naa’, 21, ‘purple’, 12]], dtype=object)
Console:
df.shape
Output:
(4, 4)
Console:
df.dtypes
Output:
name object
age int64
favorite_color object
grade int64
dtype: object

From the above outputs we can tell that our data is a 4 by 4 table containing four columns and four rows. We can also tell the data type of each column. Notice that the columns are homogeneous meaning that each holds a certain set of entries. i.e. the grades column only holds grades which are integers. (We don’t expect to see a string in that column as that will be an data entry error since grades are weighed in numbers in this scenario. ).

Sorting

We can sort values in a dataframe by using one of its columns as the base. In this instrance we will sort values by age in an ascending order.

Console:
df.sort_values(by=’age’)
Output:
name age favorite_color grade
1 Nana 19 orange 78
0 Kwadwo 20 red 67
3 Naa 21 purple 12
2 Kwame 22 green 90

Slicing

Console:
df[[‘age’,’grade’]] #Display the age and grade columns only
Output:
age grade
0 20 67
1 19 78
2 22 90
3 21 12
Console:
df[‘age’] #Display the age column only
Output:
0 20
1 19
2 22
3 21
Name: age, dtype: int64

When accessing data you can do so by selection by label or by position

Selection by label

Console:
df.loc[:2] #Display the first three rows
Output:
name age favorite_color grade
0 Kwadwo 20 red 67
1 Nana 19 orange 78
2 Kwame 22 green 90
Console:
df.loc[2:] #Display the last two rows
Output:
name age favorite_color grade
2 Kwame 22 green 90
3 Naa 21 purple 12

Selection by Position

Console:
df.iloc[3]
Output:
name Naa
age 21
favorite_color purple
grade 12
Name: 3, dtype: object
Console:
df.iloc[2:4,0:1]
Output
name
2 Kwame
3 Naa
Console:
df[df[‘age’]>20] #Display the rows that have age greater than 20
Output:
name age favorite_color grade
2 Kwame 22 green 90
3 Naa 21 purple 12

Summary Statistics

We can perform a summary statistics on our data. This means that any integers or float columns will be summarized.

Console:
df[‘age’].mean() #Get the mean of the age column
Output:
20.5
Console:
df[‘age’].std() #Display the standard deviation
Output:
1.2909944487358056
Console:
df[‘age’].min() #Display the minimum value
Output:
19
Console:
df[‘age’].max() #Display the maximum value
Output:
22
Console:
df[‘age’].var() #Display the variance
Output:
1.6666666666666667

alternatively we can display the whole output at once with a line of code:

Console:
df.describe()
Output:
age grade
count 4.000000 4.000000
mean 20.500000 61.750000
std 1.290994 34.471002
min 19.000000 12.000000
25% 19.750000 53.250000
50% 20.500000 72.500000
75% 21.250000 81.000000
max 22.000000 90.000000

Importing and exporting data

We have learnt how to create data in Pandas. But what if you have a dataset that you want to import/export? The dataset could be in any format: .csv, .txt, .xlsx, .json etc.

Take for instance a csv file called lemonade.csv containing 365 rows and 11 columns.

Console:
data = pd.read_csv(‘/home/wilson/Downloads/Lemonade.csv’)
print(data)

Now the dataset ‘Lemonade’ has been assigned a new name ‘data’. It still contains 365 rows * 11 columns in .csv format.

You can also write a file (also called exporting). I want to subset the Lemonade datafile that we imported earlier by picking the first five rows and string that data in a new file called ‘new.csv’. Then export the ‘new.csv’ as ‘export.csv’ and saving it in the downloads folder.

Console:
import pandas as pd
data = pd.read_csv(‘/home/wilson/Downloads/Lemonade.csv’)
data
new = data.head(5)
new.to_csv(‘/home/wilson/Downloads/export.csv’)

Exercise

Go to https://github.com/wbusaka/Lemonade and pull the Lemonade dataset in .xlsx format. It contains 366 rows and 11 columns. The data is generated from a Lemonade stand sales collected from January to December for the year 2017.

Open the Lemonade.xlsx’ file in Ms Excel.

Convert the ‘Lemonade.xlsx’ dataset to ‘.csv’ and save it on your computer.

Import Pandas.

Import the new ‘Lemonade.csv’ file into python.

Slice the dataset from 245 to 321 and store in it a new variable called ‘sliced_data’.

Filter the data for to obtain for only the month of ‘August’ and store it in a new variable called ‘August’.

Perform a summary statistics for the month of August (i.e. mean, max, median, standard deviation etc.).

  • As always, post the result on social media (Twitter, Facebook or Instagram) to receive a confirmation or assistance :-)

NB: This exercise utilizes concepts in this tutorial.

Conclusion

Pandas are the real deal when it comes to data science as most of the real world scenario data is stored in a dataframe format (rows and columns). As they say, learning is infinite as long as you’re breathing.

Read More

https://pandas.pydata.org/pandas-docs/version/0.22/cookbook.html#cookbook

https://www.codecademy.com/learn/data-processing-pandas

Stay Tuned!

Wilson Busaka

Written by

Wilson is a Data Scientist, Pythonista, techpreneur who loves giving back to society in the form of knowledge sharing. Follow me on twitter: @wilbusaka

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade