Using Publicly Available Data to Build a Machine Learning Model that Predicts Obesity Rates in US Cities

Vinh Tran
Future Vision

--

Is it possible to build a machine learning model on your own using publicly available data? Of course! This article will show you how with an example.

Why Publicly Available Data?

In today’s world data is gold. Often times the data we want are not readily available for a variety of reasons (it doesn’t exist, has yet to be collected or is proprietary information). Luckily there are plenty of publicly available datasets out there that can be used to build models and derive insights, you just have to look for them. In particular, the US government has troves of publicly available data including from the Census and the CDC. Google came out with a search engine for publicly available data in 2018 (Google Dataset Search). Kaggle has over 19,000 datasets on its website.

This article will review how I used publicly available data to build a machine learning model that predicts obesity rates in US Cities by going through the following steps:

  • Obtaining Data
  • Data Exploration and Cleaning
  • Feature Engineering and Model Selection
  • Conclusion

Obtaining Data

In this example, I want to build a model that predicts obesity rates in US cities. To do this, I need obesity rates by city (my target). I know that the CDC posts a variety of health statistics on their website so this is where I go first. After browsing the website I see that there is a lot of information and quickly realize obtaining data takes time! It is easy to underestimate how long this will take so plan accordingly!

  1. Be realistic about how long it will take you. For this project it took me about a couple of days (in my spare time) to figure out what model I wanted to create and then see if there were available data. I’d devote at least three hours if you don’t know where to find the data you’re looking for to search.
  2. Document your steps. It is easy to go down a click-hole and have a million tabs open. Document any datasets or websites you come across that you *might* want to use. There’s nothing worse than realizing you had what you wanted all along and having to re-trace your steps or dig through browsing history.
  3. Understand the data. Each dataset will have some caveats and it’ll be up to you to understand them. This is a time consuming but important step so that you are confident in whatever model you build.
  4. Take quick notes about the datasets. Providing some details will help you identify datasets of interest more easily and also let you take a step back and think about everything as a whole. Definitely note any caveats you found in step 3.

After more time than expected, I find the information I want, which is obesity prevalence rates by US city, in the 500 Cities project, a collaboration between the CDC, CDC Foundation and the Robert Wood Johnson Foundation. A major caveat to this dataset is that it only includes the 500 largest cities.

To make the prediction, my hunch is that I can predict obesity rates using demographic data. I naturally go to the Census website. After A LOT of digging, I finally find what I am looking for using the US Census Bureau QuickFacts tool. I downloaded a series of files that needed to be combined so I used the glob library in Python:

import globglob_list = glob.glob(“data/census/raw/*.csv”)frame = pd.DataFrame()for file in glob_list:
df = pd.read_csv(file)
frame = frame.join(df, lsuffix=’Fact’, rsuffix=’Fact’, how=’right’)

All-in-all, obtaining the data took me three days of on-and-off searching.

Data Exploration and Cleaning

You should have done some data exploration when trying to find your datasets but you will need to do more rigorous exploratory data analysis to truly figure out if the data will work for you. Since it is not the focus of this article, I suggest looking at the many Medium posts about EDA. Things to look out for in publicly available data however, include:

  1. How are data coded? For example, is income provided in ranges or is it discrete?
  2. How are missing and small values coded? Sometimes small values are not provided for data privacy reasons and are instead given a code.
  3. How are the variables created/collected and at what year?
  4. How is your data being read in? Having mixed data types can alert you to idiosyncrasies in the variable.
  5. Are there outliers? Look at a distribution.
  6. Does the data you found need to be combined? If so, how? I am combining the CDC and Census by city and state. I need to make sure that the cities and state names in both datasets line up so that they can be merged.
  7. Do your data need to be standardized or transformed?

The data from the census contains codes for small values which I set to 0. I also used quantile-based discretization to split obesity prevalence rates from the 500 Cities data into three levels (low, medium and high) for my model so that I can do a classification. I did this with the following code:

*cities is the CDC dataframe cities[“OBESITY_cut”] = pd.qcut(cities[“OBESITY_AdjPrev”], 3, labels = [“low”, “medium”, ‘high’])

Feature Engineering and Model Selection

Once the data are cleaned it is time for feature engineering and model selection. Again, there are many Medium posts about this but depending on the model(s) used (i.e., linear regression) you will need to keep in mind:

  1. What is the purpose of building the model? (i.e., inference or prediction)
  2. Are there any assumptions made by the model that your data will have to adhere to?
  3. Do you have any domain knowledge to inform variable selection?
  4. Can you perform lasso with regression or look at feature importance with random forest?

In the Census data I downloaded there are many variables that are almost perfectly correlated with each other (i.e. a series of population variables) so I removed the near duplicates. I was left with a set of features that I used to model. I tried logistic regression, random forest and gradient boosting classifiers. Random forest yielded the highest accuracy score of 66%. I used GridsearchCV to tune my hyperparameters.

rf = RandomForestClassifier()param_grid = {
'bootstrap': [True],
'max_depth': [80, 90, 100, 110],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [100, 200, 300, 1000]
}
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, y_train)rf_pred = grid_search.predict(X_test)grid_search.score(X_test,y_test)grid_search.best_params_print(classification_report(y_test, rf_pred))

I then calculated the permutation importance of each variable and saw that Median gross rent and Per capita income are the most important features in my model.

Conclusion

Using publicly available data from the CDC and Census I was able to build a random forest classifier that predicts obesity rates in US cities with 66% accuracy from beginning-to-end. No matter which model you choose the most important first step is finding and obtaining good data and this takes time. As the saying goes, “garbage in, garbage out.” Once you master this skill then you can start building models with publicly available data! Comment below with your experience finding and using publicly available data. I can also be reached on LinkedIn.

Up next: My model with 66% accuracy is ok but to improve this I have a thought of using twitter sentiment as a feature to improve model accuracy. Stay tuned for how I do it in my next article!

--

--

Vinh Tran
Future Vision

Leveraging data to make the world a better place