Measuring food and drink purchasing via k-means clustering
This article has been co-written by Izzy Stewart and Elena Mariani.
Introduction
Nesta’s healthy life mission is to increase the average number of healthy years lived in the UK, with the goal of helping the UK halve the prevalence of obesity by 2030, compared to 2020 levels.
As part of our work to better understand diets, the data science team working on Nesta’s healthy life mission has been analysing a dataset provided by an international market research company, commissioned by Nesta. This dataset contains food and drink purchases from a sample of households in Great Britain. For every purchase there is information about the product, such as: the food and drink category it belongs to, the volume purchased and its nutritional content. In other words, all the information that is captured on a product’s package and on a store’s receipt is available in the dataset.
Using this dataset, we have identified different groups of households based on the share of calories purchased across food and drink categories. This article describes the steps we’ve taken to create these groups using k-means clustering and our initial attempt at interpreting the clusters. As an exploratory piece, it also marks our first step into our wider work of analysing food and drink shopping behaviours to better understand what people eat and why. The main aims of this article are to highlight the complexity of measuring grocery shopping behaviours and to provide some reflections on the most appropriate analytical methods to achieve this.
Why group households based on their food and drink purchases?
The purchasing of food and drink (and indirectly consumption and calorie intake) are driven by complex social, economic and demographic factors. Our hope is that by classifying purchasing behaviours, we will be able to:
- Identify which foods contribute most significantly to unhealthy eating and use this insight to inform policies and interventions targeting these foods (eg, by encouraging retailers to adapt recipes to make them healthier);
- Generate hypotheses about what factors drive purchases of certain unhealthy products and then target these factors in the real world so to encourage healthier behaviours;
- Create a link between food environment characteristics (eg, which stores people shop in) and the health profile of their customers.
Grouping households using k-means clustering
Now that we know why grouping households based on their purchasing habits is important, we will highlight the process of building these groups using k-means clustering. Figure 1 describes the product and purchase-level features we used in the dataset and how they link together.
Preparing the data for clustering
We carried out the following tasks
- We performed some data-cleaning tasks, such as grouping similar categories together. We also removed some categories, such as rice and cooking oil. During the exploratory data analysis, we noticed these were common bulk-buys that tended to skew the data.
- We calculated the total number of kcal (a measure of energy in nutrition and exercise) purchased for each food and drink category per household.
- Using a min-max scaler for each household, we scaled the total kcal values across food categories (to values between 0 and 1).
- Once the household representations had been created, we applied principal component analysis (PCA) to extract uncorrelated components that explain the highest proportion of variance in the data. We tested different values for ‘variance explained’ and chose 0.97 based on the silhouette score.
- Finally, we compressed the components into a two dimensional feature space using Uniformed Manifold Approximation and Projection for Dimension Reduction (UMAP).
Clustering households
To cluster the households, we tested using both k-means and hierarchical clustering. We found k-means clustering to be the best performing method in terms of achieving the strongest separation among clusters. We also tried a robust clustering method which although seemed like a promising approach, we had issues with the size of our dataset and the time it took to run the model.
A total of 1,563 features (food and drink categories) were used to assign clusters to 23,161 households. Figure 2 shows the average silhouette score across different numbers of k, and it shows that the best performing k-means result is achieved with 70 clusters, when the silhouette score peaks at 0.517.
Figure 3 visualises the best performing k-means result (k=70) based on the silhouette score. Looking at the plot, there is a large collection of clusters at the centre, but a few clusters are clearly separated from this central collection.
Figure 4 shows the top 30 clusters based on the average silhouette score (calculated on the samples assigned to those clusters), along with the number of households. Unsurprisingly, many of the best-performing clusters based on silhouette score also have a very low number of households.
A deeper dive into the clusters
In this section, we take a deeper look at the clusters. We generated descriptive statistics to help us interpret the clusters in terms of the underlying food and drink shopping behaviour they are capturing.
To interpret the clusters, we referred back to the share of kcal purchased across categories and compared the average for each cluster to the average of the total households used to cluster. As an additional measure, we also ran a t-test of significance between the normalised values for a feature in a cluster and those of the remaining sample. To ensure the results are accurate, we updated the t-tests using the Bonferroni correction. On average 15 features (about 1% of features) were significant, with the minimum number of significant features being 5 and the maximum being 51.
Well-separated clusters
We first considered the best separated clusters by silhouette score. Excluding clusters with a very low number of households in them, we looked at four examples in the top 30 clusters. Figures 6 and 7 show the four well-separated clusters, highlighting the top 20 statistically significant features (or the total significant features if there are less than 20) by difference in kcal share compared to the average household. The size of the circle represents the p-value and the colour represents the average kcal share of the category for that cluster. The circles are sorted in descending order based on the value of the difference in share of kcal in a cluster compared to the sample average.
In all four clusters, we observe a similar pattern: one food category appears to be the dominant feature of a cluster, although several features are statistically significant and contribute positively to the calorie content of shopping baskets. Cluster 24 has the seventh highest silhouette score and contains 212 households.
If we look back at the 2D map in Figure 3, we see cluster 24 in the far left hand corner, quite far from the tight group of clusters. The difference in the calories consumed in this cluster is clearly driven by wine, with red, white and sparkling wine contributing to the biggest changes from the average kcal per household. Almost 12% of the calorie content of shopping baskets for households in this cluster is made up by red wine.
For cluster 11 (also in the top right hand corner of Figure 3’s left plot), purchases of lager are the feature separating the cluster’s purchasing behaviours from the sample average.
Over 12% of the calorie content of shopping baskets of households in cluster 11 is made up of lager. Four out of the six statistically significant features for this cluster are alcohol or alcohol alternative products. Although far away from the ‘red wine’ cluster 24 on the map, cluster 11’s nearest neighbour (cluster 30 — chart not shown) also has only alcohol positive features with Cider as the most dominant feature (13% of calorie content).
Cluster 21 is characterised by a large share of purchases of cakes, making up about 10% of the calories in their shopping baskets. Lastly, for cluster 50 the dominant category is chicken, with the majority of other positive and significant features being meat.
Interestingly, the example clusters are all characterised by having one category that has a significantly bigger difference to the average, rather than capturing more complex purchasing behaviours of multiple food categories This could be due to the way components are chosen during the PCA or UMAP stage. In the initial stages of work we tried clustering without UMAP and found the clusters to be less well-separated with lower average silhouette scores. However, in future work we will dig into this more to understand what effect PCA and UMAP have on clusters.
Less well-separated clusters
We then moved to less well-separated clusters, to see whether a different pattern of significant features emerges. Figure 8 depicts clusters 8 and 49, two clusters that are not in the top 30 by silhouette score. Cluster 8 is positioned in the middle of the graph in Figure 3, and 49 is towards the top of the central group of clusters. With these two examples, we noticed that there is significantly more of a spread across the x-axis (in figure 8), indicating that it may not be just one feature dominating the result. The biggest difference in calorie content of baskets compared to the average for cluster 8 comes from wrapped white bread at just over 3%. The remaining positive features are a mix of crisps, chocolates, cheese, biscuits and potato products. The calorie content of baskets for households in Cluster 49 is made up of almost 9% of calories from whole milk, followed by ready-to-eat cereals at just over 5%.
Limitations of the analysis
Whilst the analysis provides us with interesting insights, we must also recognise that it has a number of limitations.
Population representation
The dataset is created from a non-probability sample of households and so in its raw form it does not accurately represent the full GB population. For other projects we have used population weights to handle this problem. Future work on the clusters includes incorporating these weights in the analysis.
Using a subset of 1 month
We tested clustering on a whole year’s worth of data and found that the clusters were not as well separated compared to just using one month’s worth of data. This may be due to monthly differences in purchasing evening out over the course of a year. We chose October as a particularly ordinary month (rather than December or January), but accept there will be seasonal changes that are not accounted for in the clusters.
The clusters do not capture the absolute quantity of calories purchased
The household representations provide us with a picture of how a household’s calories are distributed across categories, but it cannot capture the scale of the difference in absolute value. For example if a household was purchasing twice as much of everything compared to another household, it would be represented in the same way. In practice, this means we cannot differentiate between households that over purchase and households that under purchase.
Conclusion and next steps
Although this is just our first step into understanding what people eat and why, it’s been a very useful process! A few things we have learnt along the way are:
- Understanding purchasing behaviours is hard, even with access to a large and detailed dataset like the one available to us. One of the aims of our work was to provide evidence for population segmentation that could be used for qualitative research and/or targeting specific behaviours. There are some trade-offs to resolve between providing easily interpretable clusters that can be used for interventions and achieving well separated clusters. Our preferred solution has been to use a k-means clustering algorithm with k=70. However, this is a very large number of groups to consider when targeting specific behaviours. Moreover, we have often been asked about describing the clusters in terms of their underlying demographic and socio-economic characteristics, which are not features included in our analysis.
- We need to do more work to interpret the clusters in a way that makes them more relatable to external audiences, and that makes them usable to the rest of our team at Nesta. We think that the analysis of significant features is a useful first step, and we will explore using natural language processing techniques to allow us to summarise the significant features to describe the content of each cluster’s shopping basket.
- There are many other possible ways to represent the data (eg, volume, quantities, calories) which might give very different results. While we tested different approaches we chose to represent the data via share of calories for two reasons: Firstly, in the data available to us, calorie content has been reported more accurately and consistently than volume. Secondly, our research question revolved around understanding purchasing behaviours, with a view to find the food categories that most significantly influence the calorie content of shopping baskets. The choice of how to represent the data needs to be linked to the research question and the type of evidence that one wants to generate.
- Representing shopping behaviours as a share of kcal across food and drink categories is a useful start, but it does not allow us to describe more complex relationships that are essential to understand when considering food choices; for example, what type of foods are most frequently bought together. Our next step is to explore this using techniques such as network analysis or market basket analysis, in order to better understand which products are more likely to be purchased together.
- We observe what households buy but not what they eat. The knowledge gap between what’s purchased and what’s consumed in a household can make it hard to draw conclusions. For example, some categories of food might be purchased in a large amount and therefore contain a lot of calories, but may not be actually consumed in that month or just by the purchasing household alone.
We hope you enjoyed reading this article! If you are working in this space please reach out to us to share ideas and discuss insights. For a more detailed look at the analysis, please also see our Github repository for the code.