Exploring Undernourishment: Part 7 — Research Area 4: Most Influential Indicator

A Visual Data Exploration Research Project to Better Understand the Nuances of Our Global Nutrition

Chris Mahoney
The Startup
5 min readOct 13, 2020

--

Image Source: Food and Agriculture Organisation of the United Nations

Contents

This is Part 7 of an 8-Part research project aiming to better understand the nuances of our global nutrition. It explores this topic through the utilisation of data visualisation and data science techniques. It is complimented by a Web App: ExploringUndernourishment, which is freely available to the public.

Part 1 — Introduction and Overview
Part 2 — Literature Review
Part 3 — Data Exploration
Part 4 — Research Area 1: General Trend
Part 5 — Research Area 2: Most Successful Countries
Part 6 — Research Area 3: Surprising Trends
Part 7 — Research Area 4: Most Influential Indicator ← Selected page
Part 8 — Recommendations and Conclusions

Research Area 4: Most Influential Indicator

For all of the features provided by the FAO, they can all be categorised in to dependent and independent variables. The full list of categorisation can be found in the data dictionary. In this section, we want to determine which of the independent variables are most influential on the Prevalence of Undernourishment target. For this, the seventeen independent features were used.

In order to determine the most influential of these features, a forest-type model was run. In this instance, the Gradient Boosted Machine (GBM) model was chosen. This was chosen not for its predictability, but for its ability to determine the most influential features; and for this reason the full data set was used (no train/test split). Due to the resampling ability of this model, and how it builds each tree for its forest, the model is quite powerful in determining which of the features are the most influential.

Variable Importance Plot

The results of running a Variable Importance analysis from this model has resulted in the plot to the right. This shows each of the features on the Y-axis, and the percentage of importance on the X-axis.

The following conclusions can be drawn:

  • The feature Avg Value Of Food Production has influenced 100% of the trees in the forest, while Avg Dietary Adequacy influenced 64.97% of the trees, and therefore these two features are incredibly important.
  • This result is consistent with the results on the Undernourishment tab, in the Features by Target section, which saw a very strong, very consistent correlation between these two features and the target feature.
  • The features Avg Protein Supply, Political Stability, and Food Imports As Share Of Merch Exports collectively influenced 62% of the trees in the forest, and are therefore somewhat important, and somewhat influential to the target result.
  • Consistent with a Pareto analysis, 20% of the features contribute 80% to the influence of the target variable.
  • The Caloric Energy From Cereals Roots Tubers feature contributed 0% to the influence, and as a result should be excluded; while three other features (Access To Improved Drinking Water, Rail Line Density, and Access To Basic Sanitation Services) contributed less than 1% each, and have a very negligible impact on the overall result.
Figure 17: Most Influential: Variable Importance Plot

Partial Dependence Plots

Another useful outcome of the GBM model is its ability to create Partial Dependence Plots. Each of the plots below are for the seventeen independent features, and each of them show the relative feature on the X-axis, and the change in Prevalence of Undernourishment on the Y-axis.

As each these plots are scanned from left to right, reviewing the X-axis variable, the corresponding expected value for the Prevalence of Undernourishment is indicated by the line. Meaning to say, as the value of X changes, the expected value of Y is shown.

These plots are also arranged in the same order as the Variable Importance Plots.

By indicating that these features are less important is primarily meant from a statistical perspective, and with reference to its ability to predict the value of the Prevalence of Undernourishment score. It does not, by any means, mean that these features are not important for the countries; for in some instances they are incredibly important. Take, for instance, Access To Improved Drinking Water and Access To Basic Sanitation Services: The model has indicated that these are not important features in terms of their predictability; however, they are incredibly important factors for individual countries to be focussing on to improve their own economies.

There are a number of interesting things which can be learnt from these plots, including:

  • It is clear to see why the top five features (Avg Value Of Food Production, Avg Dietary Adequacy, Avg Protein Supply, Political Stability, and Food Imports As Share Of Merch Exports) were the most influential; because after an amount of instability to the left of the plot, the lines stabilise and remain relatively consistent through to the right.
  • It is quite surprising to see that as the percentage of arable land increases, and as the percentage of food imports increases, and as the cereal import dependency increases, this is has an adverse affect on the Prevalence of Undernourishment, forcing this score to increase. As seen by reviewing the plots for: Percentage Of Arable Land, Food Imports As Share Of Merch Exports, Cereal Import Dependency Ratio.
  • It is clear to see why the less-predictive features are so low, and is primarily due to the inconsistency and instability of the PoU line. Ass seen particularly with: Access To Basic Sanitation Services and Caloric Energy From Cereals Roots Tubers.
Figure 18: Most Influential: Partial Dependence Plot

Findings

After having implemented a Gradient Boosted Machine model to establish the level of importance of different independent variables, it was found that five features had the highest level of influence: Avg Value Of Food Production, Avg Dietary Adequacy, Avg Protein Supply, Political Stability, and Food Imports As Share Of Merch Exports. These features are logical, and follow a consistent trend similar to the rest of the analysis. It was also found that the Pareto principle is at play here, with 20% of the features contributing over 80% to the overall results.

Having also created Partial Dependency Plots for each of the independent features, it can be seen that the variables with the higher level of influence also have a relatively consistent and stable PDP line. Whereas the features with a low-level of importance have an unstable line.

Read On:

Previous section: Research Area 3: Surprising Trends
Next section: Recommendations and Conclusions

--

--

Chris Mahoney
The Startup

I’m a keen Data Scientist and Business Leader, interested in Innovation, Digitisation, Best Practice & Personal Development. Check me out: chrimaho.com