Combining nutritionist expertise and ML to provide high-quality nutrition data worldwide

Published in

WW Tech Blog

7 min readApr 20, 2023

Authors: Reka Daniel-Weiner, Yameng Zhang

When you embark on a journey to improve your health and wellness, making informed choices on which foods you consume to fuel your body is crucial. At WW, our behavior and nutrition science teams conduct clinical trials [ref] and constantly monitor the latest scientific literature to provide our members with the most current information on which nutrients and overall dietary pattern will maximize health benefits. For example, we know that consuming fiber is beneficial for heart health while protein provides the body with important building blocks. On the other hand, saturated fats are not beneficial for heart health, and consuming sugar, especially added sugar, can increase the risk of both obesity and diabetes [ref]. We take the complexity of all of this nutritional information and summarize it into a single value (known as Points®) to behaviorally nudge our members to an overall healthier dietary pattern using the most cutting-edge nutrition and behavior change science.

This is why we strive to have the most comprehensive and complete food database in the world. However, with hundreds of millions of foods from around the globe, this is a large challenge for our nutritionist team to handle manually. Therefore they are in constant and close collaboration with us, the Data Science and Machine Learning Engineering team at WW, to fill in the blanks and provide them with precise estimates of missing values. One of the contact points where this need becomes especially apparent is when our members use the points calculator in our app. There they can input the nutrition information listed on a food label to instantly receive the Points® value for any food that they consider consuming. Unfortunately, many countries do not require all information that has been identified by scientific research to be important to estimate the impact of a food on health and wellness to be reported on the label. For example, the US only started requiring listing added sugar content in 2020/2021 [ref]. Still, in many other countries currently added sugar is not required to be listed on the label, despite its known detrimental effects on health.

In this blog post, we discuss how the Data Science and Machine Learning Engineering team at WW built an ML model to estimate the added sugar content of a food, and how we made this information available to be consumed by the calculator within the WW app as well as our internal database. As a first step, we worked together with our nutrition science experts to review currently available methods of added sugar estimation. One of the most used approaches is based on a manual process [ref], which is very valuable as an internal tool for our nutritionist team when they review existing entries in the database. However, this manual process would not have been suitable for use within our member-facing points calculator considering the subjective evaluations involved. After in-depth discussions with the nutrition science team, we came to the conclusion that the problem could be approached as a machine learning problem. After trying out alternative approaches such as linear regression and KNN we decided to go ahead with building an XGBoost [ref] model, a method which has shown great success discovering nonlinear relationships in tabular data. In addition to the high accuracy of the resulting model, due to its scalability XGBoost also seemed like an ideal choice of model as we expected the final resulting endpoint to be able to handle individual requests from the calculator with fast response times, as well as being able to respond to batch requests for backfilling suggested values to be reviewed in the food database.

As a first iteration, we went ahead and collected all food data available in the US which already had entries for added sugar values. Using total sugar (i.e. all sources of sugar, including naturally occurring and added sugars), carbohydrates, fiber, total fat, sodium, calories, saturated fat as well as food category as features, we built an XGBoost model on 80% of the available data and performed hyperparameter optimization on it. During this initial exploration phase, we also learned that training on all foods, including the ones with no sugar present (where manual estimation of added sugar would automatically default to zero) yielded a more reliable model, presumably because it allowed XGBoost to learn relationships among all nutrients present. Using the resulting model to predict added sugar values on the 20% hold-out dataset, we saw very encouraging results, which we published in a scientific article [ref]. However, we did not think a model validated on only US foods would deliver sufficiently reliable results to our members, especially globally.

As a next step, we asked our global team of nutritionists to evaluate our model predictions on manually estimated added sugar values on international data (as due to the lack of added sugar information no ground-truth dataset is available). Unsurprisingly, the manual process identified that for many foods from our international food databases, the added sugar predictions were suboptimal, and that there were specific food categories in which this was the case, for example for sugary drinks and milk products. Our nutrition science team focussed on the areas where the model was most wrong, and in several iterations, we added newly verified international foods to our training and test datasets. As the model will mostly be used in markets that do not have added sugar on the label, i.e. outside the US, we made sure that at least 30% of the dataset consisted of non-US foods. We also learned that adding the food category as a feature increased model accuracy only marginally, so we decided to remove it from the model in order to increase usability and decrease guesswork for the members when using the calculator.

In order to gain a more intuitive understanding of how the XGBoost model uses feature values and their interactions to make predictions, we also looked at SHAP values, a measure of feature importance based on game theory [ref]. The figures below provide an example of SHAP plots; each dot represents one specific food, the horizontal axis shows the value of a feature (total sugar or fiber), and the vertical axis represents which direction the value of the feature changed the added sugar prediction for each food. The top figure shows that on average the model predicts higher added sugar values for a food with higher total sugar. Intuitively this makes sense — the more sugar a food contains, the more we expect it to contain added sugar. The bottom figure adds some nuance to this observation: here, each dot is colored by its sugar value, i.e. foods high in sugar are shown in red, while foods low in sugar are shown in blue, and the value of fiber is shown on the horizontal axis. For foods that are high in sugar, but also high in fiber, the model predicts lower values of added sugar. This makes sense as foods naturally high in fiber, such as fruits, tend to have little to no added sugar. Considering such summary plots provides transparency into how the model derives predictions. This understanding can then serve as a basis for discussions with our nutrition science team to pinpoint areas where the model is underutilizing specific information and should be retrained with better, more specific, or simply more data.

After several iterations, we were able to achieve an average mean absolute error of 0.8g (corresponding to 3.2 calories) at the default portion size on global foods. As there are an average of 180 calories in a portion, and the USDA suggests consuming not more than 10% of calories from sugar [ref], we felt that this was a reasonable starting point for a first launch into production, with the prospect of iterating on the model as more data becomes available.

At WW we believe that it is crucial that the Data Science and Machine Learning Engineering team itself is responsible for deploying models [ref] so iterations can be done seamlessly. To expose the model we set up an endpoint in FastAPI [ref], which is a Python Framework and set of tools that enables developers to use a REST interface to call commonly used functions to implement applications. By sending requests containing necessary ingredient information to the endpoint, the estimated added sugar result will be included in the response to be consumed both by the calculator to provide live initial estimates, and by our database to prefill an estimated added sugar field for review by the nutrition science team. Below you see the calculator as it appears in the US (where added sugar is available on the labels; “Added sugar” is a field required to be filled out by the member) and in the UK (where added sugar needs to be estimated; no “Added sugar” field available):

Given the global application of the added sugar model, the endpoint is able to handle requests that are slightly different among different markets and apply corresponding logic. For example, in the US fiber is included under total carbohydrates, while in the UK it is not; the endpoint automatically takes this into account when it interprets the incoming information from the nutrition label. Additionally, while it is often voluntarily disclosed by manufacturers, fiber information is not required on the label in the UK. If this optional information is missing, the endpoint will first apply a fiber estimation model, which we reviewed both independently and in conjunction with the added sugar model, to get the best estimation of the field and then pass the value to the added sugar estimation model.

This endpoint has been used to batch-backfill all foods in our current database with estimations to provide guidance for manual review and is being currently hit about 100 thousand times a day, providing our members with the information they need to make informed decisions on which foods they consume to support their health and wellness journey.

Interested in joining the WW team? Check out the careers page to view technology job listings as well as open positions for other teams.

Combining nutritionist expertise and ML to provide high-quality nutrition data worldwide

Written by Yameng Zhang