In this article, I will show how to use pandas library to perform actual data analysis. Pandas is made up of a heap of methods, so I will be able to implement only a few in this article. If you are familiar with Python programming language and you are interested in learning Pandas but you are absolutely new to it, I advise you check pandas documentation so that you can gain an insight to basic operations of the library before going forward on this post. However, if you are proficient in Python and you have encountered Pandas before now then you are good to go. Nevertheless, the best way to master Pandas operations is to work on a real dataset for analysis.
What is Pandas?
Pandas is an open source data science library built on top of Numpy. It gives room for responsive analysis, data cleaning and preparation. Pandas is used for creating data frames in memory and basically, all of its operations run on these data frames. The data frames are responsible for storing our datasets. To illustrate, a Pandas DataFrame is similar to a representation of an in-memory excel sheet. In this tutorial, I will use this library to gain more insight into the food price dataset for analysis.
Brief Description of the Dataset
The food price dataset was obtained from the National Bureau of Statistics. The data contain a number of selected food items and the comparison in their prices between months and year to year in Lagos state and all other states with high and low prices respectively; It displayed the prices of different food items from January 2017 till August 2018. The data representation also displays the percentage increase and decrease in the price of food items from the year 2017 to 2018. It also reflected the percentage increase and decrease in the price of food items between July and August 2018. The last four columns display the states with the highest price and the states with the lowest price of a particular food item respectively.
Setting up your environment
In this tutorial, I will use Jupyter Notebook as my Integrated Development Environment(IDE). If there is one tool that every Data Scientist should use or must be comfortable with, it is Jupyter Notebook also known as Ipython Notebook. It provides an interactive environment during development. In other words, you can run your codes at intervals. So it will help you to identify errors very quickly. In addition, it increases productivity and collaboration among Data Scientists because they can create and share documents in peers.
If you are new to Jupyter Notebook, you can follow these instructions to set it up on your system manually.
So, I assume you now have the Jupyter Notebook successfully installed on your system. To launch the notebook simply run :
and let’s get started.
Note: This article will not teach Jupyter Notebook. You can follow this Jupyter Notebook for beginners if you would like to learn its application.
To start with, let’s import pandas in the first cell of my notebook
import pandas as pd
In the next cell, use the method below to load the CSV data into pandas DataFrame. The name of the CSV file that contains the data is food_price.csv
fp = pd.read_csv('food_price.csv')
You can as well load an excel file using pandas. Insert the method below into the next cell to load excel data into pandas DataFrame. The name of the file is food_price.xlsx. Since the file is made up of one sheet, you will need to specify the active sheet name Sheet 1. If your excel file has more than one sheet then you need to indicate the particular sheet you would like to load into the DataFrame.
df = pd.read_excel("food_price.xlsx", sheetname="Sheet 1")
You might want to preview your DataFrame by taking a snapshot of the data. This would get the first n rows and the last n rows of the DataFrame. Use head and tail pandas functions to obtain the first five rows and the last five rows of the data frame respectively.
# returns first five rows
fp.head(5)# returns last five rows
The concept of missing values is a very sensitive issue in data management. The inability to handle missing values properly might lead to inaccurate inferences drawn out of data. There are several ways to handle missing data depending on the nature of your dataset. Some of this ways include:
- Absolutely remove rows with missing values
- set median or mean and replace missing values
- set a value for missing values depending on your existing data.
However, I will utilize the first option because my data features are independent and there are no outliers in this case. So, if I delete a row that has a missing value it will definitely not affect the inference I want to draw out of the data. You can enter the dropna function below to remove rows with missing values.
It is highly recommended for any Data-driven professional to understand the nature of his/her dataset early enough before performing operations on it. The shape of the data describes the basic features of the data frame such as the number of rows and columns. This knowledge will help you irrespective of your specialization as a Data Analyst, Data Scientist, Data Engineer or Machine Learning Engineer from encountering shape error(s) as you forge ahead in your work. You can enter the command below to check the shape of your data frame.
It is also a good idea to be up to date about the information of your DataFrame. You can obtain information about the index data type and column data types, non-null values and memory usage. Enter the info function below to print the summary of your DataFrame.
You can check the range of index on the DataFrame using the index method.
Now that you know your index range, you might want to obtain the second row of the data frame in the index 1 using the integer location(iloc) function. The iloc method takes an integer value as parameter. You can as well obtain the second row of the data frame in the index 1 using labels as parameter instead with location(loc) function. You can enter the command below to print the second row in your data frame.
The describe function computes the summary statistics such as count, mean, standard deviation, Interquartile range values, minimum and maximum values for numeric columns only. It excludes character column. Enter the command below to display the summary statistics of the food price DataFrame.
Now, let us solve some real-world problems with the library using the food price data.
To start with, Let’s print out the food items with purchasing price above 1500Naira between June and July 2018 in Lagos state.
So the results obtained indicate three food items; dried Mudfish, dried Catfish and smoked Catfish with prices of above 1,500Naira between June and July 2018 in Lagos state.
Next, Let’s calculate the mean of the highest and lowest price of all the food items. The higher price and lower price are the column names respectively.
# returns the average of the higher price spent on the purchase of
# food items
fp['higher price'].mean()# returns the average of the lower price spent on the purchase of
# food items
The average of the higher and lower price spent on the purchase of food items should return approximately 890 and 421Naira respectively.
Now let’s calculate the highest and lowest purchase price for food items in August 2018 using the max and min functions respectively. The column name is Aug-18.
# returns highest purchase price for food items
fp['Aug-18'].max()# returns lowest purchase price for food items
The maximum and the minimum purchase price of food items in August 2018 is approximately 1992 and 43Naira respectively.
We want to calculate the seven most popular states with high prices of food and the number of times they appear expensive on food items.
The seven most popular states returned with higher prices of food items in decreasing order are Bayelsa, Kogi, Ekiti, Lagos, Anambra, Sokoto, and Yobe. It shows that Bayelsa has the highest number of food items at high prices.
Let’s obtain the seven most popular states with low prices of food and the number of times they appear expensive on food items.
The seven most popular states printed with lower prices of food items in decreasing order are Taraba, Gombe, Bauchi, Cross River, Oyo, Ogun, Osun. It shows that Taraba, Gombe, and Bauchi has the highest number of food items at lower prices.
Let’s obtain the percentage change in price of yam tuber from January 2017 to August 2018 in Lagos state.
fp[fp["ItemLabels"] == 'Yam tuber']['Year on Year %']
It indicates there was a 37.55% increase in the price of yam tuber from January 2017 to August 2018 in Lagos state.
Let’s obtain the percentage change in the price of yellow garri from January 2017 to August 2018 in Lagos state.
fp[fp[“ItemLabels”] == ‘Gari yellow,sold loose’][‘Year on Year %’]
It indicates there was a 9.7% reduction in the price of yellow garri from January 2017 to August 2018 in Lagos state.
Let’s obtain the percentage change in the price of chicken wings from July 2018 to August 2018
fp[fp["ItemLabels"] == 'Chicken Wings']['Month on Month %']
It indicates there was a 3.9% reduction in the price of Chicken wings from January 2017 to August 2018 in Lagos state.
Overall, we have been able to draw some meaningful conclusions from the food price dataset with the Pandas library. You can use this tutorial to learn how to apply pandas operations on real datasets. You can as well get your hands dirty by applying other pandas operations that were not employed in this tutorial on the real dataset. For the complete codebase on the notebook, you can access the Github repo here.
I hope you found this tutorial helpful. Kindly look forward to articles on different Data Science and Machine Learning technologies. In the meantime, feel free to drop your comments.