A Gentle Introduction to Pandas
I hope you had an exciting intro to Numpy because Pandas builds on it. What is Pandas? Forget about the song ‘Panda’ by American hiphop artiste Desiigner or the animal (KungFu Panda enthusiasts), this is a python library that offers powerful and flexible data structures that make data manipulation and analysis easy. It stands for “Python Data Analysis Library” and according to Wikipedia, the name Pandas is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.
Pandas has three data structures: Series, DataFrames and Panels.
- A Series is 1-dimensional labelled array that can hold data of any type (integer, string, float, python objects, etc.). It’s axis labels are collectively called an index.
- A DataFrame is 2-dimensional labelled data structure with columns
- A panel is 3-dimensional. For this post we will not be discussing about panels.
Series
The syntax is: pd.Series( data, index, dtype, copy)
Console:
import pandas as pd
import numpy as np
data = np.array([‘Tom’,’Jerry’,’Nick’,’Harry’,’Ruth’,’Gloria’])
names = pd.Series(data)
print (names)Output:
0 Tom
1 Jerry
2 Nick
3 Harry
4 Ruth
5 Gloria
dtype: object
Create a series from array with index
Console:
import pandas as pd
import numpy as np
data = np.array([‘Tom’,’Jerry’,’Nick’,’Harry’,’Ruth’,’Gloria’])
names = pd.Series(data, index=[100,101,102,103,104,105])
print (names)Output:
100 Tom
101 Jerry
102 Nick
103 Harry
104 Ruth
105 Gloria
dtype: objectNotice the difference in the numbering, it changed from 1–5 to 100–105.
Create a series from Dictionary
Console:
import pandas as pd
import numpy as np
data = {
‘student’ : [‘Tom’, ‘Jerry’, ‘Gloria’, ‘Hillary’],
‘age’ : [21, 34, 45, 67],
‘gender’ : [‘Male’, ‘Female’, ‘Female’, ‘Male’]
}
Student = pd.Series(data)
print (Student)Output:
student [Tom, Jerry, Gloria, Hillary]
age [21, 34, 45, 67]
gender [Male, Female, Female, Male]
dtype: object
Create a Series from Scalar
Console:
import pandas as pd
import numpy as np
score = pd.Series(10, index=[‘A’, ‘B’, ‘C’])
print (score)Output:
A 10
B 10
C 10
dtype: int64
Accessing data from series with position
Console:
import pandas as pd
import numpy as np
data = np.array([1,2,3,4,5])
position = pd.Series(data)Console:
position[0] #first element in the arrayOutput: 1Console:
position[:3] #first three elements in the arrayOutput:
0 1
1 2
2 3
dtype: int64Console:
position[-1:] #the last element in the arrayOutput:
4 5
dtype: int64
DataFrames
Those that have used R before can relate to this data structure as it was inspired by R’s own dataframes. A dataframe can contain a Pandas DataFrame, Series, Numpy array or dictionaries of 1-dimensional.
Let’s go ahead and get our hands dirty.
Console:
import pandas as pd
import numpy as npdata = {
‘name’: [‘Kwadwo’, ‘Nana’, ‘Kwame’, ‘Naa’],
‘age’: [20, 19, 22, 21],
‘favorite_color’: [‘red’, ‘orange’, ‘green’, ‘purple’],
‘grade’: [67, 78, 90, 12]
}df = pd.DataFrame(data)
print(df)Output:
name age favorite_color grade
0 Kwadwo 20 red 67
1 Nana 19 orange 78
2 Kwame 22 green 90
3 Naa 21 purple 12Console:
df.columnsOutput:
Index([‘name’, ‘age’, ‘favorite_color’, ‘grade’], dtype=’object’)Console:
df.valuesOutput:
array([[‘Kwadwo’, 20, ‘red’, 67],
[‘Nana’, 19, ‘orange’, 78],
[‘Kwame’, 22, ‘green’, 90],
[‘Naa’, 21, ‘purple’, 12]], dtype=object)Console:
df.shapeOutput:
(4, 4)Console:
df.dtypesOutput:
name object
age int64
favorite_color object
grade int64
dtype: object
From the above outputs we can tell that our data is a 4 by 4 table containing four columns and four rows. We can also tell the data type of each column. Notice that the columns are homogeneous meaning that each holds a certain set of entries. i.e. the grades column only holds grades which are integers. (We don’t expect to see a string in that column as that will be an data entry error since grades are weighed in numbers in this scenario. ).
Sorting
We can sort values in a dataframe by using one of its columns as the base. In this instrance we will sort values by age in an ascending order.
Console:
df.sort_values(by=’age’)Output:
name age favorite_color grade
1 Nana 19 orange 78
0 Kwadwo 20 red 67
3 Naa 21 purple 12
2 Kwame 22 green 90
Slicing
Console:
df[[‘age’,’grade’]] #Display the age and grade columns onlyOutput:
age grade
0 20 67
1 19 78
2 22 90
3 21 12Console:
df[‘age’] #Display the age column onlyOutput:
0 20
1 19
2 22
3 21
Name: age, dtype: int64
When accessing data you can do so by selection by label or by position
Selection by label
Console:
df.loc[:2] #Display the first three rowsOutput:
name age favorite_color grade
0 Kwadwo 20 red 67
1 Nana 19 orange 78
2 Kwame 22 green 90Console:
df.loc[2:] #Display the last two rowsOutput:
name age favorite_color grade
2 Kwame 22 green 90
3 Naa 21 purple 12
Selection by Position
Console:
df.iloc[3]Output:
name Naa
age 21
favorite_color purple
grade 12
Name: 3, dtype: objectConsole:
df.iloc[2:4,0:1]Output
name
2 Kwame
3 NaaConsole:
df[df[‘age’]>20] #Display the rows that have age greater than 20Output:
name age favorite_color grade
2 Kwame 22 green 90
3 Naa 21 purple 12
Summary Statistics
We can perform a summary statistics on our data. This means that any integers or float columns will be summarized.
Console:
df[‘age’].mean() #Get the mean of the age columnOutput:
20.5Console:
df[‘age’].std() #Display the standard deviationOutput:
1.2909944487358056Console:
df[‘age’].min() #Display the minimum valueOutput:
19Console:
df[‘age’].max() #Display the maximum valueOutput:
22Console:
df[‘age’].var() #Display the varianceOutput:
1.6666666666666667
alternatively we can display the whole output at once with a line of code:
Console:
df.describe()Output:
age grade
count 4.000000 4.000000
mean 20.500000 61.750000
std 1.290994 34.471002
min 19.000000 12.000000
25% 19.750000 53.250000
50% 20.500000 72.500000
75% 21.250000 81.000000
max 22.000000 90.000000
Importing and exporting data
We have learnt how to create data in Pandas. But what if you have a dataset that you want to import/export? The dataset could be in any format: .csv, .txt, .xlsx, .json etc.
Take for instance a csv file called lemonade.csv containing 365 rows and 11 columns.
Console:
data = pd.read_csv(‘/home/wilson/Downloads/Lemonade.csv’)
print(data)
Now the dataset ‘Lemonade’ has been assigned a new name ‘data’. It still contains 365 rows * 11 columns in .csv format.
You can also write a file (also called exporting). I want to subset the Lemonade datafile that we imported earlier by picking the first five rows and string that data in a new file called ‘new.csv’. Then export the ‘new.csv’ as ‘export.csv’ and saving it in the downloads folder.
Console:
import pandas as pddata = pd.read_csv(‘/home/wilson/Downloads/Lemonade.csv’)
data
new = data.head(5)new.to_csv(‘/home/wilson/Downloads/export.csv’)
Exercise
Go to https://github.com/wbusaka/Lemonade and pull the Lemonade dataset in .xlsx format. It contains 366 rows and 11 columns. The data is generated from a Lemonade stand sales collected from January to December for the year 2017.
Open the Lemonade.xlsx’ file in Ms Excel.
Convert the ‘Lemonade.xlsx’ dataset to ‘.csv’ and save it on your computer.
Import Pandas.
Import the new ‘Lemonade.csv’ file into python.
Slice the dataset from 245 to 321 and store in it a new variable called ‘sliced_data’.
Filter the data for to obtain for only the month of ‘August’ and store it in a new variable called ‘August’.
Perform a summary statistics for the month of August (i.e. mean, max, median, standard deviation etc.).
- As always, post the result on social media (Twitter, Facebook or Instagram) to receive a confirmation or assistance :-)
NB: This exercise utilizes concepts in this tutorial.
Conclusion
Pandas are the real deal when it comes to data science as most of the real world scenario data is stored in a dataframe format (rows and columns). As they say, learning is infinite as long as you’re breathing.
Read More
https://pandas.pydata.org/pandas-docs/version/0.22/cookbook.html#cookbook
Python Data Analysis Library - pandas: Python Data Analysis Library
Edit description
pandas.pydata.org
https://www.codecademy.com/learn/data-processing-pandas
Stay Tuned!