Nutrients by the Numbers: Food and Nutrition Statistics with Wolfram Language

This is an excerpt from a post written by Gay Wilson and Isabel Skidmore that was originally published on the Wolfram Blog. The full post can be viewed here.

Avocado, cucumber, basil, and tomato looking as if they are being tossed over a yellow background while being sprinkled with spices

Statistical analysis is an important tool in food science. It can uncover patterns and relationships in food and nutrition data, leading to advances in food manufacturing, nutrition counseling, food safety and new product development. Wolfram Language offers built-in functions for all standard statistical distributions. Here, we’ll use some of these functions to evaluate relationships between nutrients and visualize the data distributions with informative plots and histograms.

Interpreter for Food Entities

Use Interpreter to gather and group the entities for the foods you want to explore. The “yellow box” entities contain the nutritional data for each food type:

Interpreter in the Wolfram Language for berries
Interpreter in the Wolfram Language for citrus
Interpreter in the Wolfram Language for greens
Interpreter in the Wolfram Language for greens
Interpreter in the Wolfram Language for fish

T-Tests for Zinc and Folate

A t-test is a statistical tool used to answer the question “Is the difference in the averages (means) of two groups statistically significant, or are the means different due to random chance?” Let’s use the TTest function to determine if the zinc and folate in berries are significantly different from the zinc and folate in green vegetables.

Berries and green vegetables are not significant sources of zinc, but we can use statistics to evaluate and compare trace amounts of this vital nutrient. Start with the null hypothesis that there’s no meaningful difference between berries and green vegetables in terms of their zinc content. Next, obtain the zinc amounts for each of the food types in both groups. The t-test does not require the sample lengths to be equal. Get only the values, not the units, using the QuantityMagnitude function:

QuantityMagnitude function showing values of zinc for berries and greens

What is the average (mean) zinc content for each group?

Mean function showing values of zinc for berries and greens

The t-test does require normal distribution of the data. The TTest function automatically tests for normal distribution, but you can check it yourself using the DistributionFitTest function. This function will return a p-value, which is the probability that the data satisfies a given null hypothesis. The default null hypothesis for DistributionFitTest is that the data comes from a normal distribution:

DistributionFitTest applied to zinc values for berries and greens

We will use the common significance level α of 0.05, or 5%, to determine whether to reject or fail to reject the null hypothesis. Because both of these p-values from DistributionFitTest are greater than 0.05, we fail to reject the null hypothesis and conclude that zinc data for berries and green vegetables is normally distributed. Therefore, we know that the t-test is appropriate to use:

TTest function performed on the zinc values of berries and greens

The p-value from the t-test is less than 0.05. Therefore, we can reject the null hypothesis and conclude that there is a significant difference in the average zinc content of berries versus green vegetables. Easily visualize this difference using PairedSmoothHistogram:

PairedSmoothHistogram function used on the berries and greens zinc values, allowing for the data to be easily compared

Next, we examine the difference in average folate content:

Folate value of berries and greens, visualized with the previous functions

Like zinc, the t-test result below 0.05 confirms that we can reject the null hypothesis because the folate difference between berries and green vegetables is statistically significant. Wolfram Language provides both full and shortened conclusions of the test:

T-test differences and conclusions shown for the folate values between berries and greens

A paired histogram illustrates this difference in the two datasets:

Paired histogram using PairedHistogram function showing the floate value differences between berries and greens

Mann–Whitney Test for Iron

There are multiple ways to visualize the distribution of datasets. A number line plot is a compact way to compare the distribution of two datasets:

Functions showing values of iron in berries and greens

Scatter plots and bar charts are also effective visuals, with multiple options to customize the charts:

Various charts, including plot graphs and bar charts, including iron values between greens and berries

A related plot is a box-and-whisker chart. The box represents the middle 50% of the data values; the white line in the box represents the median. The vertical lines are the whiskers, which show the range of values, excluding any outliers (there is an option to include the outliers in the chart):

Box-and-whisker chart with iron values for berries and greens

Let’s evaluate the average iron difference for berries versus green vegetables by first checking for normal distribution:

DistributionFitTest function for iron in berries and greens

The green vegetables iron data has a p-value below 0.05 and, therefore, is not normally distributed. When the sample data is skewed rather than normally distributed, you can use the Mann–Whitney U test to determine whether two population distributions have roughly the same shape and location. It is called a nonparametric test and does not require a normal distribution like the t-test does:

Mann-Whitney U test used for testing iron in berries and greens

The resulting p-value is slightly greater than our chosen significance level α of 5%. Therefore, we must fail to reject the null hypothesis and conclude that there is no statistically significant difference in the average iron content of berries versus green vegetables. A smooth histogram is a good way to view the overlap between the two datasets:

Smooth histogram of iron values in berries and greens, showing overlap between the two values

Use the TrimmedMean function to remove data outliers that may be skewing a result. In this example, we trim the outlying 10% of data from both ends and obtain a new mean:

TrimmedMean function on the iron values of greens

Hope you enjoyed this excerpt! The original post features much more nutritional analysis, such as an analysis of variance (ANOVA) of iron in meats and fish and a linear correlation of calories and fats in meats, among other things. Check out the original post here.

--

--

Tech-Based Teaching Editor
Tech-Based Teaching: Computational Thinking in the Classroom

Tech-Based Teaching is all about computational thinking, edtech, and the ways that tech enriches learning. Want to contribute? Reach out to edutech@wolfram.com.