Predicting Happiness with Climate and Topography
A State-by-State study using Linear Regression Techniques
To what extent does air quality affect mood? Will a home in the arid Rocky Mountains fulfill a family more than one in the humid southeastern swamps? Are peppy Midwesterners actually happier than their coastal counterparts?
As an aspiring data scientist with a love for all things outdoors, I will attempt to answer some of these questions. In this study, I will discuss my data acquisition process, how exactly I define “happiness”, and the techniques I used to see how we can predict happiness from climatic and topographical features.
My overall goal is to see if we can predict how happy the people are in an American state, given that state’s topography and climate.
From my own personal experience, I love fresh air, pretty views, and somewhere to swim (or maybe just to dip my feet in if it’s cold). But I don’t enjoy tons of rain and muggy weather makes me uncomfortable. Not to my surprise, these views aren’t particularly unique.
To begin this study, I set about gathering information on each American State by scraping websites using the Python BeautifulSoup package (here a useful tutorial on getting started with BeautifulSoup). I looked at climate features such as average humidity, temperature, rainfall, and air quality, and topography features such as mean elevation, highest point, and percent area water (how much of a state is covered by lakes, ponds, and rivers).
Happiness is a subjective term that means different things depending on who you ask. In order to view it in an objective light, I focused on three studies (one conducted by WalletHub and two by Gallup) that survey people in each state in an effort to rank all 50 states by a happiness index (a score from 0–100). I averaged the rankings from these three studies to come up with my own happiness index.
Exploratory Data Analysis
An important part of any data science project is visualizing your raw data to see if it makes sense. So I dumped all features and my happiness index (bottom row) into a correlation heat-map and did some sanity checks.
A warmer color indicates a positive relationship while a cooler color indicates a negative one. Faint colors indicate a lack of a relationship. It looks like highest point correlates positively with mean elevation and humidity correlates negatively with number of clear days. Interestingly enough, rainfall has the strongest negative correlation with happiness. Neat!
As none of the correlations between my features raised any red flags, I was happy to continue with my analysis.
Creating an Initial Model
I made a baseline linear regression model using my raw data with the python library statsmodels. Here are some useful statistics from the model:
The R-squared and adjusted R-squared values demonstrate how much variability in happiness this model can explain. The probability of the F-statistic means that there was a 10.9% chance that this model was discovered randomly. And the high condition number (a normal number would be below 1.6E+04) indicates multicollinearity (meaning that some features have high correlation with one another).
Generally speaking, this isn’t a great model. It doesn’t explain much of the variability in happiness, there’s a decent likelihood it was discovered by chance, and the multicollinearity violates the assumption that my features are independent from one another.
We can only go up from here!
In regression analysis, it can often be useful to add interaction terms when you think that the combination of features affects the target variable as opposed to the individual features themselves. I decided to add three interaction terms: humidity with rainfall, temperature with rainfall, and mean elevation with highest point.
It can also be useful to transform your features when they demonstrate a non-linear relationship (such as an exponential relationship) with your target variable. I observed mean elevation, percent area water, and highest point to have logarithmic relationships with happiness, which you can see below:
So I transformed them each by taking their square roots.
Though it may be hard to tell, you can observe how the transformed data has linear (albeit loosely) relationships with happiness as opposed to logarithmic relationships.
Using these newly engineered features, I created a second model. Here are the summary statistics for this model compared to the baseline model:
Wow! The R-squared values both increased significantly, and the probability of the F-statistic dropped. This is a testament to the utility of feature engineering and compelling evidence that there is a substantial relationship between climate, topography, and happiness.
However, there is still a high condition number, which means that the multicollinearity problem between my features has not been solved. It actually got worse, which makes sense since my newly engineered features would not be independent from the raw features they are composed of.
At this point, I decided to take a step back and look at my three features which were the most predictive of happiness. Those features were mean elevation (transformed), air quality, and percent area water (transformed). The interaction terms didn’t really seem to help!
Keeping these terms in mind, I decided to implement regularization to further refine my model:
When creating predictive models, the practice of regularization essentially boils down to penalizing the complexity of a model in an effort to enhance its ability to generalize to unseen data. An overly complex model will predict the target variable (happiness, in my case) from in-sample data well, but will perform poorly when presented with out-of-sample data.
There are two main methods of regularization for regression problems. These are known as Lasso and Ridge regression. Lasso will help by eliminating unneeded features and Ridge will reduce collinearity (Here is a useful article on the important differences between the two methods). Since I was interested in both these affects, I tried both methods.
I first created Lasso and Ridge models for the three most important features I mentioned above (mean elevation transformed, air quality, and percent area water transformed) along with an additional feature: the humidity — rainfall interaction term. Here is a visualization comparing the performance of the two models for the above set of features:
The axes of this graph are true happiness against predicted happiness, meaning that predictions from a perfect model would fall exactly on the green line. Upon inspection, it looks like Lasso predictions are consistently better towards the edges of the distribution, but Ridge predictions are better in the middle.
The vertical distance between a given prediction point and the prefect green line is the error associated with that prediction. RMSE (root mean squared error) is the average error across all predictions a model makes (in the graph above, it is the average distance a prediction is from the perfect line). RMSE is a good metric for determining a regression model’s usefulness.
The RMSEs for the Lasso and Ridge models visualized above are 4.77 and 4.25, respectively. This means that for a given set of features, the Lasso prediction of happiness will be 4.77 points off target, while the Ridge prediction will be only 4.25 points off.
Unfortunately, the standard deviation in the happiness scores is around 3.4 points. So these predictions are quite off.
In the hopes of finding a better model, I tried excluding/including various features. I found that using the three most important features mentioned above and replacing the humidity — rainfall feature with temperature was a combination that produced significantly improved models. Here is the same visualization as above except for models with the new set of features:
The RMSEs for the Lasso and Ridge models corresponding to the new set of features were 3.33 and 3.51.
Using the Lasso model with the latest set of features, I predicted the happiness of all 50 states using their respective climate and topography features and ranked them. Below you can see my model’s prediction for the top 15 happiest states compared to the actual top 15 happiest states:
The color red indicates that my model’s predicted ranking for the state was within 5 spots of its actual ranking.
My model made some great predictions and some terrible ones as well. What does this mean?
Perhaps I could have distilled a better model. Maybe there were interaction terms I did not try that would have been particularly useful. Maybe the series of features I used was not the best for predicting happiness. Maybe I could’ve optimized my Lasso and Ridge regressions with a wider hyper-paramter search (here is an interesting source on this topic).
More likely however, is that the relationship between climate, topography, and happiness exists but is not that strong. The presence of predictive capability in my model does demonstrate that there is a relationship between the features I analyzed and happiness, but it does leave much variability in happiness unexplained. To have a more robust happiness predictor, we likely would have to look at features outside of climate and topography.
My hunch is that median household income, percentage of people below the poverty line, and wealth disparity would be strong predictors for overall happiness.
Furthermore, in creating a more robust model, it could be helpful to look at countries as opposed to American states. Performing linear regression on only 50 data points (50 states) is ill-advised. Predictive models tend to perform better when trained on more data.
All things equal, this model should not be used as a happiness predictor. But it does prove that there is some relationship between climate, topography, and happiness, and that climate and topography features should be included in a more expansive set of features in order to predict happiness.
In the future, I will incorporate monetary and demographic features (perhaps using this kaggle dataset), and refine my happiness metric by looking for more studies. I will also take a deeper dive into the rich landscape of feature engineering, as it provided the largest performance boost I encountered during this process.
Thanks for reading my post and I hope you enjoyed it! Feel free to send any questions or comments my way. See you next time!