Nutrients by the Numbers: Food and Nutrition Statistics with Wolfram Language
This is an excerpt from a post written by Gay Wilson and Isabel Skidmore that was originally published on the Wolfram Blog. The full post can be viewed here.
Statistical analysis is an important tool in food science. It can uncover patterns and relationships in food and nutrition data, leading to advances in food manufacturing, nutrition counseling, food safety and new product development. Wolfram Language offers built-in functions for all standard statistical distributions. Here, we’ll use some of these functions to evaluate relationships between nutrients and visualize the data distributions with informative plots and histograms.
Interpreter for Food Entities
Use Interpreter to gather and group the entities for the foods you want to explore. The “yellow box” entities contain the nutritional data for each food type:
T-Tests for Zinc and Folate
A t-test is a statistical tool used to answer the question “Is the difference in the averages (means) of two groups statistically significant, or are the means different due to random chance?” Let’s use the TTest function to determine if the zinc and folate in berries are significantly different from the zinc and folate in green vegetables.
Berries and green vegetables are not significant sources of zinc, but we can use statistics to evaluate and compare trace amounts of this vital nutrient. Start with the null hypothesis that there’s no meaningful difference between berries and green vegetables in terms of their zinc content. Next, obtain the zinc amounts for each of the food types in both groups. The t-test does not require the sample lengths to be equal. Get only the values, not the units, using the QuantityMagnitude function:
What is the average (mean) zinc content for each group?
The t-test does require normal distribution of the data. The TTest function automatically tests for normal distribution, but you can check it yourself using the DistributionFitTest function. This function will return a p-value, which is the probability that the data satisfies a given null hypothesis. The default null hypothesis for DistributionFitTest is that the data comes from a normal distribution:
We will use the common significance level α of 0.05, or 5%, to determine whether to reject or fail to reject the null hypothesis. Because both of these p-values from DistributionFitTest are greater than 0.05, we fail to reject the null hypothesis and conclude that zinc data for berries and green vegetables is normally distributed. Therefore, we know that the t-test is appropriate to use:
The p-value from the t-test is less than 0.05. Therefore, we can reject the null hypothesis and conclude that there is a significant difference in the average zinc content of berries versus green vegetables. Easily visualize this difference using PairedSmoothHistogram:
Next, we examine the difference in average folate content:
Like zinc, the t-test result below 0.05 confirms that we can reject the null hypothesis because the folate difference between berries and green vegetables is statistically significant. Wolfram Language provides both full and shortened conclusions of the test:
A paired histogram illustrates this difference in the two datasets:
Mann–Whitney Test for Iron
There are multiple ways to visualize the distribution of datasets. A number line plot is a compact way to compare the distribution of two datasets:
Scatter plots and bar charts are also effective visuals, with multiple options to customize the charts:
A related plot is a box-and-whisker chart. The box represents the middle 50% of the data values; the white line in the box represents the median. The vertical lines are the whiskers, which show the range of values, excluding any outliers (there is an option to include the outliers in the chart):
Let’s evaluate the average iron difference for berries versus green vegetables by first checking for normal distribution:
The green vegetables iron data has a p-value below 0.05 and, therefore, is not normally distributed. When the sample data is skewed rather than normally distributed, you can use the Mann–Whitney U test to determine whether two population distributions have roughly the same shape and location. It is called a nonparametric test and does not require a normal distribution like the t-test does:
The resulting p-value is slightly greater than our chosen significance level α of 5%. Therefore, we must fail to reject the null hypothesis and conclude that there is no statistically significant difference in the average iron content of berries versus green vegetables. A smooth histogram is a good way to view the overlap between the two datasets:
Use the TrimmedMean function to remove data outliers that may be skewing a result. In this example, we trim the outlying 10% of data from both ends and obtain a new mean:
Hope you enjoyed this excerpt! The original post features much more nutritional analysis, such as an analysis of variance (ANOVA) of iron in meats and fish and a linear correlation of calories and fats in meats, among other things. Check out the original post here.