Food Price Prediction using Regression —Data Cleaning and Preprocessing

Rusiri Illesinghe
6 min readOct 13, 2021

--

Photo by Daria Shevtsova from Pexels

Table of Contents :

1. Brief Introduction

2.The Dataset

3.Data Cleaning

4.Data Pre-processing

Brief Introduction

This is the 1st part of my demonstration on how to predict food prices using a real world dataset. This contains the introduction, data cleaning and preprocessing steps. For the feature selection, model training and evaluation please click here to refer the second part of this!

The dataset I’m using here contains the food prices details of several years in Sri Lanka.

Food price varies with time and some other factors depending on the context. By looking at the patterns affected by external factors we can predict prices using machine learning approaches.

For the demonstration I’m using Google Colab Notebook

Dataset

Thankfully, the dataset was taken from https://data.humdata.org/

You can download the dataset from here :https://data.humdata.org/dataset/wfp-food-prices-for-sri-lanka/resource/3638f0d6-9969-48cf-a919-1d879d037ec6

Now we have the dataset! But yet we don’t know how the data is organized there. Data can be in different forms. And also some of the data fields may obviously will not affect the food price. If that so you may remove them before proceeding further.

If you open the .csv file you can see different data field and their values. To get a clear idea on the dataset we deal with, we can view it in our Colab notebook.

Here I have downloaded and stored the datafile in Google Drive.

creating a DataFrame from datafile

We can see that the 0th row contains some unwanted data. Hence we remove that row.

This is how the dataframe appears now

Initial DataFrame without 0th row

Data cleaning

At this point, there are 4168 rows and 13 columns in the data frame.

Then I’m looking for missing values, because otherwise our model will go wrong with missing values. There are many ways to deal with missing values.

We can simply remove the records with missing values and also we can give a value for the missing data, based on the rest of the data etc. Here let’s see how many columns will remain if we drop the records (rows) which contain the missing data values. Then I re-arrange the data frame by removing empty spaces of deleted rows.

(Row count, Column count) of data frame

Still there are 3995 rows, which is a satisfiable amount of data points compared to the data points we initially had.

Data Preprocessing

Let’s start!

LKR (Sri Lankan Rupees) is the currency used in Sri Lanka. The ‘price’ field contains the price of a particular item in LKR. Hence we don’t need a separate field to contain the ‘currency’. I simply remove that column from dataframe.

‘usdprice’ field contains the same price converted in USD.

We are going to do a price prediction in LKR. So you may identify that usdprice field is not helpful either in feature set or label set. So, simply I’m removing that field.

I remove the ‘pricetype’ field also because it’s not related to the study.(I’m not going to explain these things more, to prevent flying away the focus of this article)

Also fields containing different administrative divisions were renamed according to Sri Lankan context, for the clarity of understanding.‘admin1’ was renamed as ‘province’ and ‘admin2’ was renamed as ‘district’ .

Now let’s proceed further.!

In this dataset we have price values of a given item on a given date. Especially in a developing country like Sri Lanka, the prices of the market items varies rapidly with time. So the year and month are very important facts that, according to them the prices will vary. But, the date is of the format YYYY-MM-dd. But how the machine and the underlying generic algorithms understand them while training. Although they are dates it will act as different strings in the training process which will not help for a better training.

So let’s extract year and month from the price list and include them in two separate fields. No more we need the field ‘date’. Hence we remove it.

Now you see that the month column only contains months (1 to 12) and year column have the year.

Year is a numeric value. Apart from that, as numerical values we have only ‘price’, ‘longitude’ and ‘latitude’. All the rest are type of Nominal.

What about the month? It’s categorical. Can we give weights for months? Weight of 1 for January? 12 for December. It’s not a comparison right! We use 1 for January since it’s first month of the year according to standard way. It’s an identifier. Now you may understand that there’s no point of training the month field with values from 1 to 12!

See again, what else left now?

Apart from the fields we discussed, all the other fields are also Nominal. We cant use them to train our model as it is. We have to process them.

First we’ll take the field ‘category’.

You can see the unique values it contains by using following line of code.

We can see that, there are 6 categories. As same as in the example of ‘months’ we discussed previously, here also our model will trains with false facts, if we number these categories from 1 to 6. Because the models gets that as a weight value for the particular category. No point of assigning weights for food categories based on our own desire right !

So I’m making dictionary of those 6 categories.

To see what we are going to do, for example look at the following row,

Now what I’m doing is, I create 6 fields (6 columns) in my dataframe for those 6 categories. Now we have columns ‘pulses and nuts’, ‘vegetables and fruits’, ‘meat, fish and eggs’ and so on..!

Since the category of the above data record is ‘cereals and tubers’, I mark ‘1’ in ‘cereals and tubers’ field and 0 in rest of the category fields.

Like this :

We put just 1 ,0 there! It’s binary! Because you can see that , if category is not belongs to vegetables and fruits we just put zero there. No weight is added! And if its ‘cereals and tubers’ type, then put 1 as the field values for ‘cereals and tubers’. The highest weight value those 6 fields can have is 1. So you see those fields depicts only the existence. Not a number. So you are done with those nominal values now!

Now we don’t need the ‘category’ field, since we have these 6 fields. So simply drop it!
Now, instead of field category in our data frame we have 6 fields with names of those categories as field names!

At this point this is how your dataframe appears!

You can preprocess all the other nominal values in the same way we did for ‘category’ field! Try it out!!!

After preprocess all the other fields like the above, then write your dataframe to csv and save it in your drive to access it further!

We have come a long way! Now you know how to preprocess the values in the dataset before training a machine learning model.

Click here to refer the second part of the article to predict the prices, including feature selection, model training and evaluation!

Cheers!!!

--

--