Predicting Food Insecurity

By : William Nafack, Dimitrios Pilitsis, Theodore Clarke, Ishaq Gunawan, Kenneth Tan

William Nafack
9 min readAug 1, 2022
Photo by Megan Thomas on Unsplash

It is not unbeknown to people that a lot of countries in Africa, South America, Asia , Middle-East suffer from severe Famine crisis. Many households in developing countries are yet to afford sustainable food production and live in very poor conditions.

The RHoMIS(Rural Household Multi Indicator Survey) is a survey that contains a number of core modules on farming practice, livelihoods, and food security of households. The information generated by the survey can be used to assess the prevalence of household food insecurity and to detect changes in the food insecurity situation of a population over time which are valuable for food organisations and NGOs working in the food and Agriculture departments.

In collaboration with the Alan Turing Institute, we explore the RHoMIS data in an attempt to automate the process of assigning the Food insecurity level to these households and investigate the main factors that contributes to it . It is evaluated by two metrics in the survey namely the HFIAS(Household Food Insecurity Access Scale) and FIES (The Food Insecurity Experience Scale). These metrics are obtained from a set of questions asked in the surveys some of which can be viewed below.

snippet of the questions asked during the survey to calculate the FIES score

Our code can be viewed on github. In the following sections, we will go through our approach involving the data pre-processing, exploratory data analysis, feature selection and modelling.

Data pre-processing

RHoMiS has aggregated multiple surveys into one large data set that contains over thirty five thousand observations from 33 countries. The data set contains over eight hundred variables ranging from the Poverty Probability Index (PPI) likelihood, which estimates the probability that a respondent is above/below the poverty line , to the amount of land that is owned by a farmer. In the first instance, we have to explore and understand which factors affect Food Insecurity and their relationship. Here is a first look at the data set.

snippet of the data set

The data wrangling process was set into our pipeline and repeated iteratively to further sanitise the data as we explored and developed the model. Initially, the RHoMIS data set contained a large amount of missing data due to it being aggregated from multiple International surveys. The figure below shows the significant amount of missing data for a sample of features from the data set.

Missing Data in our Data set

We also had a range of issues which affected the data quality including translation and harmonisation of categorical data, unification of target variable, non-standard units, outliers and recall error. The aforementioned points warranted significant cleaning and imputation in order to perform any data analysis. Here imputation is simply the process of replacing the missing values with substitute values. We first performed the following steps iteratively towards cleaning the data before moving on to handling specific cases related to it.

  • Features that just served as indexing a particular survey were removed
  • A range check was performed to remove illegal values
  • Certain categorial variables were translated to English to ensure uniform categorical values
  • Outliers present in numerical features were set to Null to be later imputed into the data set

When it comes to predicting Food Insecurity we have two scores in our survey, the HFIAS status and the FIES score. Both metrics were used by the FAO (Food and Agriculture Organization of the United Nation) with the FIES being the up to date metric of choice. Further investigation and research into the survey structures and metrics allowed us to convert the FIES to discrete scores which were comparable to that of the HFIAS categorical values. This allowed us to unify the HFIAS and FIES variables into one target variable called Food Insecurity Level for our 4-label classification task. The discrete assignment was as such.

After this initial step, we proceeded with Imputation to handle the missing data. We experimented with different Imputation techniques like case deletion ( delete the missing values), mode imputation ( replace missing values by the mode value of your non-missing data) , random forest imputation, regression imputation ( use a machine learning estimator to estimate the missing values ). We chose the one which generally retained the distribution of the different features imputed. For example, as shown in the box plots below random forest regression does a much better job at keeping the mean and bounds of the original distribution of the PPI compared to regression imputation.

The categorical features present in our data set had to be changed in a way that would be suitable for our model to analyse. We used one hot encoding for all categorical features except Months. Due to the cyclical nature of this feature, encoding it using binary values would cast away this cyclical information which would be beneficial to the model. As a result, we performed a cyclical encoding to translate the months to points that form a circle on a plane as shown below.

Cyclical encoding of the Months variable

As a last step we normalised the data as it improves the performance of optimisation based techniques. We made use of a minimum maximum scaler since it makes no assumptions on the distribution of the features.

Exploratory Data Analysis

Once the data is cleaned and preprocessed, we went into analysing which features correlated with our target variable. Before, we have a quick view at the distribution of the Food Insecurity level which shows a class imbalance on the Mildy Food Insecure(2) label.

One of our assumptions was that a clear metric such as the PPI would have a strong correlation with the Food Insecurity Level. Unexpectedly , this was not the case as shown on the correlation heat map below, which demonstrated that most of the relationships between the features were likely non-linear.

The amount of features we had in the data ( 96 features) added to the complexity of the model we would choose because of the increase in dimensionality. To this regards, using a threshold of 0.7 and pairwise plots ,we eliminated features that were heavily correlated with another as they would not increase the predictive power of the model.

Pairwise plot between features to investigate correlation

Feature Selection

Before developing a model, we need a pipeline to systematically decided what features are desirable to be used for our model. Using too many features can overfit the model and it would not generalise well. Furthermore, any features that only add to the dimensionality of the data and could not be tracked during our data analysis will be discarded at feature selection.

We performed a feature selection of k features using different methods such as chi-squared test, mutual information and recursive feature elimination. Some of these methods are explained here. They serve to obtain the top k features for our model of choice. We initially train an estimator on all the features and determine each feature’s importance. We then prune the least important features from the set of all features and recursively do so until we have a set of k features where k = {10, 20,30,40,50,60,70,96 } and k is the number of features retained after feature selection. Logistic regression was used as a baseline feature estimator. The figure below shows the F1-scores from the estimator using the different feature selection methods.

There is a consistent positive relationship between k and the F1 score for all the feature selection methods. In addition, the F1 scores did not converge to a value as the number of of features increased which indicated that we would need to utilise all the features.

Modelling and Analysis

The data analysis performed has revealed the data set to be considerably noisy with a lack of significant correlations of individual features with the Food Insecurity Level. Our aim is to produce a model that gives the best accuracy which some times can also be one which is less interpretable. Our hypothesis for this problem is that simple models such as logistic regression and support vector with linear kernel will have too much bias to accurately capture the relationships. Instead, we conjecture that methods such as support vector machines with a higher dimensional kernel, neural networks or sensitive tree-based methods are more likely to capture the highly complex underlying relationships and avoiding overfitting.

The models developed were assessed using their accuracy scores, F1 scores and Shapley values. Shapley values measure the contribution of each feature in a model based on a game theoretic approach to explain the output of any machine learning model. We first experimented with different classifiers with their default hyperparameters to determine which methods clearly showed potential to be explored further for hyperparameter tuning. The figure below shows the result of the classifiers used on a 75:25 split of the data into train and test after passing it through our pre-processing pipeline mentioned earlier.

The linear models were among the worst performing model as previously hypothesised and were discarded. The random forest, neural networks and support vector machine model were used for further hyper parameter tuning. These models can often be black boxes when it comes to their interpretability so we made use of Shapley values to get more insights into what features were affecting the model output. The 3 main models were most informed by a similar combination of features as shown below.

Shapley values showing the features that contributes the most for each class

From the plot we see that features with the highest average impact on a model output were the number of months households were food insecure, followed by the PPI. The best model we achieved was the random forest classifier after hyperparameter tuning with an accuracy of 70% and F1-score of 68% as shown below.

A recall score of 80% for the Severely Food Insecure class (4) is extremely important because it means that for a given case of somebody being Severely Food Insecure, the random forest would detect it with an accuracy of 80%. This would be very appropriate for an early warning system where in reality, the priority is to help those that face the greatest dangers. A few false positives do not pose a major problem. We also showed that the PPI likelihood an asset based indicator, is significantly more informative about food security than most traditional features such as farm income or total income.

Conclusion

We used the RHoMIS data set to build a machine learning classifier that would predict the food insecurity level of a household. We believe this is a huge contribution in the direction of using data science for social good and hope you learnt one or two interesting things from it.

--

--