Food Price in Mozambique — Building the Prediction Algorithm

6 min readOct 13, 2023

Introduction

In this tutorial, we’ll explore in detail how to build a food price prediction model using Machine Learning techniques. This process is essential for making informed decisions, identifying opportunities and mitigating risks related to the food market.

Through clear and practical steps, we’ll guide you from collecting and preparing data to building and evaluating the model. You’ll learn how to transform historical data into valuable insights, allowing you to anticipate changes in food prices.

Let’s begin the journey towards building a reliable and effective food price prediction model.

Dataset Information

Name: Mozambique — Food Prices
Source: https://data.humdata.org/dataset/wfp-food-prices-for-mozambique
License: Creative Commons Attribution for Intergovernmental Organisations

Dataset Dictionary

date: The date when the price data was recorded
admin1: The primary administrative region (e.g., state, province)
admin2: The secondary administrative region (e.g., district)
market: The name of the market where the price data was collected
latitude: The latitude coordinate of the market location
longitude: The longitude coordinate of the market location
category: The category to which the commodity belongs (e.g., cereals and tubers, pulses and nuts)
commodity: The specific name or type of commodity (e.g., rice, maize, beef)
unit: The unit of measurement for the price (e.g., kg, liter)
priceflag: A flag indicating any specific conditions or annotations related to the price data
pricetype: The type of price data (e.g., retail, wholesale)
currency: The currency in which the price is recorded
price: The price of the commodity in the local currency
usdprice: The price of the commodity converted to US dollars (USD)

Data Preprocessing

Having the dataset in hands, let’s start the preprocessing. First of all, we need to visualize the data to gain more understanding.

To manipulate the dataset we’ll use our graceful pandas library.

Let’s visualize the top and bottom points of the dataset.

Describe and search for null values.

Note that the first line of the dataset contains comments about the keys. let’s remove this line along with the null values.

let’s check and convert the data types.

In the case of the date, we are interested in working with years and months, so let’s extract and divide them into different columns.

In our dataset, there are two distinct types of price flags.

Some samples display the real price of the commodity, while others show the predicted price. However, in all samples where the price should be predicted, the value is set to 0.

So let’s remove any sample that don’t provide the actual current price of the commodity.

Let’s see the commodities arranged in the dataset.

The dataset has prices in both MZN (Mozambican Metical) and USD (US Dollar). However, in this task, we’re only interested in working with prices in MZN. So, let’s remove the USD price column, as well as the currency type column.

Since all the samples in our dataset have real prices, we can safely remove the column that indicates price flags.

And finally we’ll rename some columns to give more emphasis to the keys.

Let’s take a look at how the dataset looks now that we’ve cleaned it up.

And finally save the new dataset in the processed data directory.

Feature Engineering

Okay, now that we’ve cleaned and transformed the dataset, let’s get it ready for the model by selecting and transforming important features.

For the current year (2023) we need to know how many months we have in the dataset.

For the year 2023, our dataset only includes data up to February. Upon closer examination, we also identified a lack of data samples from years prior to 2000.

This data imbalance has the potential to introduce bias into our model’s predictions. Such bias could be problematic, especially if the years with few data points include uncommon events or conditions not adequately represented in the dataset. In such situations, the model may generate inaccurate price predictions, either overestimating or underestimating them.

To address this concern, our analysis will primarily concentrate on the years spanning from 2000 to 2022.

Once this is done, we’ll eliminate the “non-food” category from the dataset.

Pay close attention to this particular step:

The dataset initially organizes prices by market, and each province may include multiple markets. For this model, our main concern is to establish generalized prices for each province.

To achieve this, we’ll remove the market and district columns. The new price for each commodity will be calculated as the average of the original prices within each market within the province.

And Voila! A dataset of commodity prices by province.

Now let’s associate each commodity with its respective unit. Certain commodities may be available in multiple units. For instance, “Rice” is available in both kilograms and 25-kilogram units.

Let’s see how many types each column has.

We have 54 different types of commodities in the column. However, some of these commodities occur infrequently in the dataset. To mitigate the risk of bias, we’ll develop a function to remove these infrequent occurrences.

This will also be useful in decreasing the quantity of columns in the dataset following the application of one-hot encoding.

Now we can perform one-hot encoding on columns containing non-numeric values in order to prepare the dataset for use with the model.

Model

And finally the work is over. From now on it’s just fun.

Let’s build our model based on the transformed features of the dataset.

For this specific problem, we’ll use the Linear Regression Algorithm in our model.

Linear regression is an excellent choice for predicting prices because of its simplicity, ease of interpretation, computational efficiency, and its capacity to serve as a starting point for more advanced models.

In simple terms, the idea is to draw a straight line that best fits historical data, in order to represent the general relationship between these factors and the prices. Once this line is found, we can use it to make future price predictions based on the values of these factors.

Split the data into training and testing sets with the classic 20% for testing.

Create the model and fit the training sets. Then make predictions with the testing set.

The most crucial aspect!

We’ll evaluate the model’s performance using Mean Squared Error and Coefficient of Determination.

The Mean Squared Error (MSE) is a metric that measures how well a model’s predictions match the actual values. A lower MSE indicates that the model is performing better, as it means that the predicted values are closer to the real ones. Essentially, it shows how small the errors in the model’s predictions are.

On the other hand, the Determination Coefficient (often denoted as R²) assesses the quality of the model’s predictions concerning the variability seen in the actual data. It has a scale from 0 to 1, where 1 signifies a perfect fit of the model to the data. A value close to 1 (such as 0.976) suggests that the model does an excellent job explaining the variation in the price data. This indicates that the model is accurate in its predictions.

In summary, based on these evaluation results, the model is performing very well. It has a low MSE, indicating minimal prediction errors, and a high R² value (97.6%), suggesting that it accurately explains the fluctuations in pricing data.

Lastly, let’s create a function to predict prices using our model. The function will take the features as input parameters and provide the predicted price as its output.

And that’s when the magic happens. We entered data for a commodity linked to Cabo Delgado province in January 2023, and the outcome was 72 MZN, while the actual price is 75 MZN. There’s just a 3 MZN margin of error in this case.

Not bad, considering that the model didn’t have any data for 2023.

And so we have our model for predicting food prices in Mozambique. But what insights can we gain from it? How can it be utilized? And what financial implications does it hold?

These questions are answered in the “Understanding the Prediction Algorithm” article where we delve into the model’s socio-economic perspective, showing its true essence and its impact in Mozambique. It’s certainly worth a read. You can also access to the project’s repository.