Correlating Food Safety Data
So the main question here is:
Can we collect all this data? And if collected, can we correlate it with one another so we can empower decision making?
We will focus on attempting to identify correlations between the reason behind a recall (i.e. hazard) in respect to various factors that exist throughout food safety data (eg. ingredient involved in a recall).
What is out there?
- food recalls happening on a market level;
- border rejections;
- country indicators (risk, corruption and global food security index).
Let’s take a step back and start diving into our data. What is at our disposal?
Food recalls & border rejections. This data, if analysed fully, offers:
- date of recall;
- reason (i.e. hazard) behind the recall;
- main ingredient (product) that was recalled;
- origin country of the main ingredient;
- country where the product was distributed to.
Country indicators. These indicators concern the risk, corruption and global food security index of each country. They are produced by taking into account various other indicators and contain the following:
- country each refers to;
- date of validity from/to;
- value of the indicator.
Time for the correlation to take place
Let’s make a first attempt at a correlation matrix, first by creating our complete dataset, performing some minor preprocessing on the date column (splitting it into year and month), loading in into a dataframe and letting python take care of the rest. The complete code can be found at the end of this post.
Time to make the first attempt into producing the correlation matrix.
Ok, not bad for a first attempt, but there is a long way ahead it seems. Let’s add some more data into the mix! Risk, corruption and global food security indicators are publicly available; it’s time to integrate them into our correlation matrix.
In order to do this we have to further dive into our data. We have origin and distribution country. An intuitive decision is to integrate for both these countries (which are possibly different) the respective risk, corruption and global food security indicators, of course by taking into account the date of the recall thus the indicator value that was valid at the time.
Time to rerun our experiment!
Ok, it seems somewhat better but still a long way from actionable results. There is however data we have not yet taken into account.
What if we group our ingredients and hazards (i.e. reasons behind the recall) on a broader category? FOODAKAI offers a detailed categorization on both! How will the above depicted image change?
It seems we are on the right track, time for some intuition to take place. Agricultural commodities are somewhat seasonal. Can this intuition be integrated into our correlation matrix and affect it in a positive way? Let’s find out!
Time to use the date of the recall to that end and group it in terms of the season the recall took place. How will this affect our matrix?
Seems like a decision in the right direction. Great, but let’s take another step forward. We have a high diversity in the country indicators, possibly adding noise into our dataset.
What if we group together these indicator values into classes?
This can possibly help in our correlation matrix. Let’s split corruption and global food security indices into 10 classes and see how this performs in our correlation analysis.
Great, we can see that some correlations occur in terms of origin and distribution country, as well as hazard and product categories. And this is a good base and knowledge to use in order to train a Machine Learning or Deep Learning model in order for it predict food safety cases throughout the world, based on the above indicators and features!
If you scroll up-top in this post and compare the initially produced correlation matrix with the final one, the progress made is obvious.
What should one keep in mind?
Preprocessing your data and feature expansion are two of the most important tasks in any serious data science attempt.
Know your data. To produce any kind of actionable results one should know the domain behind the dataset at hand. Only then can intuition show its full potential.
Always try to challenge yourself (and the data). The above analysis is not perfect (and will never be). For instance, it does not take into account production indicators and consumption info on a country level. Not to mention price or weather data, affecting agricultural commodities, or animal diseases affecting meat production (eg. african swine fever).
What if we also add this data into our model as well? Does it help in identifying relationships in a sector as diverse (in terms of data types) as the food safety one?
We plan on discussing this in the future; as mentioned, the food safety sector is one that offers a great variety of data (big data Vs beware!) so why not further increase it?
Analysis Statistics and complete code
The dataset used for the correlation analysis consists of:
- 98.281 food recalls and border rejections,
- 2.418 distinct hazards,
- 14.291 distinct products and ingredients,
- 208 countries.
The complete code for the above analysis can be found below.