Pandas for food price analysis

In this article, I will show how to use pandas library to perform actual data analysis. Pandas is made up of a heap of methods, so I will be able to implement only a few in this article. If you are familiar with Python programming language and you are interested in learning Pandas but you are absolutely new to it, I advise you check pandas documentation so that you can gain an insight to basic operations of the library before going forward on this post. However, if you are proficient in Python and you have encountered Pandas before now then you are good to go. Nevertheless, the best way to master Pandas operations is to work on a real dataset for analysis.

What is Pandas?

Pandas is an open source data science library built on top of Numpy. It gives room for responsive analysis, data cleaning and preparation. Pandas is used for creating data frames in memory and basically, all of its operations run on these data frames. The data frames are responsible for storing our datasets. To illustrate, a Pandas DataFrame is similar to a representation of an in-memory excel sheet. In this tutorial, I will use this library to gain more insight into the food price dataset for analysis.

Brief Description of the Dataset

The food price dataset was obtained from the National Bureau of Statistics. The data contain a number of selected food items and the comparison in their prices between months and year to year in Lagos state and all other states with high and low prices respectively; It displayed the prices of different food items from January 2017 till August 2018. The data representation also displays the percentage increase and decrease in the price of food items from the year 2017 to 2018. It also reflected the percentage increase and decrease in the price of food items between July and August 2018. The last four columns display the states with the highest price and the states with the lowest price of a particular food item respectively.

Setting up your environment

In this tutorial, I will use Jupyter Notebook as my Integrated Development Environment(IDE). If there is one tool that every Data Scientist should use or must be comfortable with, it is Jupyter Notebook also known as Ipython Notebook. It provides an interactive environment during development. In other words, you can run your codes at intervals. So it will help you to identify errors very quickly. In addition, it increases productivity and collaboration among Data Scientists because they can create and share documents in peers.

If you are new to Jupyter Notebook, you can follow these instructions to set it up on your system manually.

So, I assume you now have the Jupyter Notebook successfully installed on your system. To launch the notebook simply run :

jupyter notebook

and let’s get started.

Note: This article will not teach Jupyter Notebook. You can follow this Jupyter Notebook for beginners if you would like to learn its application.

To start with, let’s import pandas in the first cell of my notebook

import pandas as pd

In the next cell, use the method below to load the CSV data into pandas DataFrame. The name of the CSV file that contains the data is food_price.csv

fp = pd.read_csv('food_price.csv')

You can as well load an excel file using pandas. Insert the method below into the next cell to load excel data into pandas DataFrame. The name of the file is food_price.xlsx. Since the file is made up of one sheet, you will need to specify the active sheet name Sheet 1. If your excel file has more than one sheet then you need to indicate the particular sheet you would like to load into the DataFrame.

df = pd.read_excel("food_price.xlsx", sheetname="Sheet 1")

You might want to preview your DataFrame by taking a snapshot of the data. This would get the first n rows and the last n rows of the DataFrame. Use head and tail pandas functions to obtain the first five rows and the last five rows of the data frame respectively.

# returns first five rows
fp.head(5)
# returns last five rows
fp.tail(5)

The concept of missing values is a very sensitive issue in data management. The inability to handle missing values properly might lead to inaccurate inferences drawn out of data. There are several ways to handle missing data depending on the nature of your dataset. Some of this ways include:

  • Absolutely remove rows with missing values
  • set median or mean and replace missing values
  • set a value for missing values depending on your existing data.

However, I will utilize the first option because my data features are independent and there are no outliers in this case. So, if I delete a row that has a missing value it will definitely not affect the inference I want to draw out of the data. You can enter the dropna function below to remove rows with missing values.

fp.dropna()

It is highly recommended for any Data-driven professional to understand the nature of his/her dataset early enough before performing operations on it. The shape of the data describes the basic features of the data frame such as the number of rows and columns. This knowledge will help you irrespective of your specialization as a Data Analyst, Data Scientist, Data Engineer or Machine Learning Engineer from encountering shape error(s) as you forge ahead in your work. You can enter the command below to check the shape of your data frame.

fp.shape

It is also a good idea to be up to date about the information of your DataFrame. You can obtain information about the index data type and column data types, non-null values and memory usage. Enter the info function below to print the summary of your DataFrame.

fp.info()

You can check the range of index on the DataFrame using the index method.

fp.index

Now that you know your index range, you might want to obtain the second row of the data frame in the index 1 using the integer location(iloc) function. The iloc method takes an integer value as parameter. You can as well obtain the second row of the data frame in the index 1 using labels as parameter instead with location(loc) function. You can enter the command below to print the second row in your data frame.

fp.iloc[1]

The describe function computes the summary statistics such as count, mean, standard deviation, Interquartile range values, minimum and maximum values for numeric columns only. It excludes character column. Enter the command below to display the summary statistics of the food price DataFrame.

fp.describe()

Now, let us solve some real-world problems with the library using the food price data.

To start with, Let’s print out the food items with purchasing price above 1500Naira between June and July 2018 in Lagos state.

fp[(fp['Jun-18']>1500)& (fp['Jul-18']>1500)]

So the results obtained indicate three food items; dried Mudfish, dried Catfish and smoked Catfish with prices of above 1,500Naira between June and July 2018 in Lagos state.

Purchasing price above 1500Naira

Next, Let’s calculate the mean of the highest and lowest price of all the food items. The higher price and lower price are the column names respectively.

# returns the average of the higher price spent on the purchase of     
# food items
fp['higher price'].mean()
# returns the average of the lower price spent on the purchase of     
# food items
fp['lower price'].mean()

The average of the higher and lower price spent on the purchase of food items should return approximately 890 and 421Naira respectively.

Average higher and lower price spent on purchase of food items

Now let’s calculate the highest and lowest purchase price for food items in August 2018 using the max and min functions respectively. The column name is Aug-18.

# returns highest purchase price for food items
fp['Aug-18'].max()
# returns lowest purchase price for food items
fp['Aug-18'].min()

The maximum and the minimum purchase price of food items in August 2018 is approximately 1992 and 43Naira respectively.

Highest and lowest purchase price on food items

We want to calculate the seven most popular states with high prices of food and the number of times they appear expensive on food items.

fp['high state'].value_counts().head(7)

The seven most popular states returned with higher prices of food items in decreasing order are Bayelsa, Kogi, Ekiti, Lagos, Anambra, Sokoto, and Yobe. It shows that Bayelsa has the highest number of food items at high prices.

Most popular states with high prices of food items

Let’s obtain the seven most popular states with low prices of food and the number of times they appear expensive on food items.

fp['low state'].value_counts().head(7)

The seven most popular states printed with lower prices of food items in decreasing order are Taraba, Gombe, Bauchi, Cross River, Oyo, Ogun, Osun. It shows that Taraba, Gombe, and Bauchi has the highest number of food items at lower prices.

Most popular states with lower prices of food items

Let’s obtain the percentage change in price of yam tuber from January 2017 to August 2018 in Lagos state.

fp[fp["ItemLabels"] == 'Yam tuber']['Year on Year %']

It indicates there was a 37.55% increase in the price of yam tuber from January 2017 to August 2018 in Lagos state.

Percentage of the change in price for Yam tuber

Let’s obtain the percentage change in the price of yellow garri from January 2017 to August 2018 in Lagos state.

fp[fp[“ItemLabels”] == ‘Gari yellow,sold loose’][‘Year on Year %’]

It indicates there was a 9.7% reduction in the price of yellow garri from January 2017 to August 2018 in Lagos state.

Percentage of the change in price for Yellow Garri

Let’s obtain the percentage change in the price of chicken wings from July 2018 to August 2018

fp[fp["ItemLabels"] == 'Chicken Wings']['Month on Month %']

It indicates there was a 3.9% reduction in the price of Chicken wings from January 2017 to August 2018 in Lagos state.

Percentage of the change in price for Chicken wings

Overall, we have been able to draw some meaningful conclusions from the food price dataset with the Pandas library. You can use this tutorial to learn how to apply pandas operations on real datasets. You can as well get your hands dirty by applying other pandas operations that were not employed in this tutorial on the real dataset. For the complete codebase on the notebook, you can access the Github repo here.

I hope you found this tutorial helpful. Kindly look forward to articles on different Data Science and Machine Learning technologies. In the meantime, feel free to drop your comments.