Predicting Food Deserts Via Population Health and Twitter Sentiment Analysis!

Karina Patel
Future Vision

--

Ever wondered what would happen to your health if you didn’t have access to grocery stores or fresh produce? Keep reading to learn about just how many people deal with this every day!

Be sure to check out my interactive webapp to visualize the results of my twitter sentiment analysis!

Inspiration and Motivation:

Nutrition and overall access to food is a critical component to well-being and the overall health of an individual! Unfortunately, many regions around the world lack access and/or the ability to consume fresh and quality foods.

Food deserts are defined as areas across the United States, often low-income, where there is limited access to nutritious and affordable food, especially fruits vegetables, whole grains, and low-fat milk. More specifically, food deserts are census tracts where more than 500 people or over 33 percent of the population in that tract must travel over a mile for fresh groceries.

Over 23.5 million people reside in food deserts, and while there have been efforts to implement grocery stores in some areas to see how eating habits and overall health changes, it is sometimes too late to change pre-existing habits. The areas with relatively few grocery stores will also likely be areas that tend to be under-resourced and disadvantaged in other ways. Additionally, grocery stores are less likely to build in lower income neighborhoods and often veer away from areas where other businesses have not built due to concern of success.

If food desert formation can be predicted beforehand, preventative action and grocery store implementation could help reverse at-risk regions! Additionally, looking at health patterns across the US in relation to food deserts helps uncover the association of these conditions with poor health.

Map of the current food desert distribution in the United States

Goal/Scope:

Is it possible to predict food deserts using population health and twitter sentiment? The goal is to utilize factors/features that are not currently being used by the United States Department of Agriculture Economic Research Service (USDA ERA) to predict food deserts in hopes that other census data can help us flag the food desert regions early on. This analysis provides information about what population health factors are correlated with food deserts and just how big of an impact these conditions have on the health of individuals.

Calculating social media sentiment is a great way to find out how residents feel about different types of foods and restaurants. Twitter allows people are able to express opinions quickly and concisely, which makes it a great source for gathering census population data across the United States quickly and at scale!

Combining population health and social media sentiment, the aim is to map how different opinions and statistics about health map to the areas of food deserts and look at consumption trends across the U.S.

Business and Health Value:

This model would help to inform the public of at-risk areas where there is room for growth. Growth can wear many different hats and act at a variety of scales, spanning from fresh food initiatives to grocery store implementation!

These grocery stores could also benefit from this predictive modeling when deciding on areas to consider building. Opening in areas with fewer grocery stores present could result in additional profit given the low competition while also benefiting the census tract by providing access to nutritional meals. The access to fresh produce and groceries allows for residents to transition to healthier meals which would overall benefit the health of the population!

Data Collection:

Health behavioral data was published by the Population Health Division of the Center for Disease Control and Prevention. This dataset contains statistics regarding overall health for census tracts across the United States.

Example of fast food restaurants weighted by prevalence in Twitter data.

Social media data was also collected through the Twitter API to analyze patterns of healthy vs. unhealthy consumption. To accomplish this filtered search, 4 different topics: healthy foods, unhealthy foods, fast foods, and grocery stores were created. Finally, all tweets mentioning any of the foods or stores/restaurants on the 4 lists were pulled!

After pulling and filtering the data from twitter, the prevalence of the restaurants is proportional to the size of the word in the fast food cluster exampled above. Each of these datasets were pulled into MongoDB, and stored for pre-processing.

The target data, a classification of whether each census tract is currently classified as a food desert or not, came from US Department of Agriculture (USDA) dataset.

Data Collection and Processing:

  1. Step 1 was to merge the food desert target data with population health records using census tract as the key. Because there are over 72,000 census tracts in the United States, filtering down to census tracts in 500 largest US cities made sense to narrow the scope without scoping down to a single state or region
  2. After merging these two data sets, the feature matrix was narrowed from over 300 features down to about 15 using correlation analysis and health intuition
  3. Upon running intial models on the population health data alone, twitter data was added to the project to allow for a well-rounded dataset of how people feel about different food categories
  4. Over the span of 1.5 weeks, over 3 million tweets were collected, each falling into the category of healthy foods, unhealthy foods, fast food restaurants, or grocery stores. All the data was stored in MongoDB for filtering and querying through Python!
  5. Because the goal was to ultimately map the tweets back to the population health dataset, only those geotagged tweets with latitude longitude data from the United States could be used in the analysis:
    - Removed any tweets with country code other than “US” and language other than “eng”
    - Filtered out the tweets without coordinate information
    - Cleaned tweet text to remove any emojis, URLs, @mentions, or retweets
  6. After cleaning and filtering down to a subset of tweets, NLTK’s Vader Sentiment Analyzer was utilized to calculate the sentiment on the cleaned text as features in the model
  7. Each of the tweets with an associated latitude, longitude pair was next mapped back to a corresponding US census tract. The Federal Comminucations Commission (FCC) API returned the census block which the latitude longitude pair belonged to, which was further converted further to census tracts to merge the population health data and food desert targets
  8. There were many census tracts lacking any twitter data due to the short period of time data was being collected along with the narrow “food” scope. In contrast, larger cities and urban areas had a myriad of tweets falling inside a single tract:
    - For those with over 5 tweets in the tract, the average sentiment of all tweets was taken for for each of the four categories separately: healthy foods, unhealthy foods, fast food restaurants, and grocery stores
    - To account for the variance among the tracts, any tracts without any tweets was filled with the average sentiment of it’s corresponding county for the same four categories
  9. This was the final pre-processing step. The final feature matrix contained the population health features and 4 additional twitter sentiment features for each of the twitter pulls:
Features of the model in order of importance in differentiating food desert vs non-food desert

Results:

A number of different models were ran on the data to compare performance of different binary classifiers. Comparing ROC-AUC scores for SVM, KNN, Logistic Regression, Random Forest, and Gradient Boosting below, it is clear that certain models performed better than others.

Receiver Operating Characteristic (ROC) Curve

The best model found was using Gradient Boosting with a depth of 5, learning rate of 0.1, and number of estimators set to 100. Predicting food deserts for the test set based off this model, the following results were achieved:

  • ROC-AUC: 0.88
    This gives an idea of how well model can distinguish between a census tract which is a food desert vs one which isn’t
  • Accuracy: 0.89
    This is the the fraction of predictions our model got right
  • Recall: 0.83
    What proportion of actual food deserts were identified correctly by my model?
  • Precision: 0.86
    What proportion of food desert identifications was actually correct?
  • F1 Score: 0.84
    This score is an overall measure of a model’s accuracy that combines precision and recall. A high F1 score (close to 1) means that the model is good at minimizing false positives (model predicts food desert, isn’t actually one) and false negatives (model doesn’t predict a food desert when it is one).

Interactive Visualization:

I created an interactive flask webapp to visualize the results of my twitter sentiment analysis.

  • Tweet Map: Map of all geotagged tweets with the category and the sentiment of the tweet
  • State Map: Summarizing healthy vs. unhealthy sentiment distribution across the U.S. by state
  • County Map: Summarizing healthy vs. unhealthy sentiment distribution across the U.S. by county

Future Steps:

  1. Break down tweets into food composition: new features for calories, protein, sugar, carbohydrate, and fat breakdown for the food(s) mentioned in the tweet
    - Allow us to approximate the measure of nutrition for a given tweet to help determine food intake across the U.S.
  2. Expand dataset: continue to pull more twitter data to get more accurate estimates of sentiment in all census tracts
    - Some census tracts had no tweets within the tract boundaries which limits the analysis I was able to perform
  3. Diversify my queries
    - By splitting the four categories of tweets I pull, I would be able to get a clearer estimate of what sorts of foods different areas have preference for
    - Additionally, pulling more types of foods each time I queried from Twitter’s API would allow for me to get a more expansive dataset spanning the distribution of foods across the U.S. more extensively

Tools Used:

  1. Twitter API
  2. FCC API
  3. Amazon Web Services
  4. MongoDB
  5. NLTK
  6. Carto API
  7. Carto.js
  8. SQL
  9. Flask
  10. Beautiful Soup

Thanks for reading!!

Feel free to check out my interactive flask webapp hosted on AWS and other cool projects on my Github and LinkedIn! Don’t hesitate to reach out if you have any questions or are interested in doing a similar project!

Have a great day and hope you enjoyed reading about food deserts :)

--

--