Simple Applications of Multiple Regression Models — Part 2: Can we Predict World Happiness?

Manali Shinde
One Datum At A Time
10 min readApr 11, 2018

--

Hello Readers!

Welcome back to my blog — this is going to be a bit of a lengthy one but I hope you stay for the full ride because it’s also a super interesting one!

In this case study, I am going to be exploring sklearn’s method of constructing a predictive model and drawing conclusion through correlations. The libraries I’m using are below — and while I haven’t included all the code for my visuals (because they are a bit too long!), you can find the code on my Kaggle kernel here: https://www.kaggle.com/mshinde10/predicting-world-happiness

Python Libraries Used

Thus, let us dig in — Can we determine world happiness?

Executive Summary

On March 20th, the world celebrates the International Day of Happiness. On this day, in 2017, the UN also release the World Happiness Report — a ranking of which countries in the world could be considered as “happy”. This report contains 155 countries from each continent to construct an understanding of which countries may be the happiest. This ranking is revered across the globe, as it could be an indication of the country’s policy-making decision skills. Experts around the world ( in economy, psychology, and foreign affairs) have noted that these scores may be a good indication of a country’s progress — but of course, it is not the end all and be all of a country’s progress.

Now, you may be wondering, how exactly can happiness be determined? The UN looked at seven different variables and took a survey of the population in order to construct an overall “happiness score”. This score was a measure of data taken by the Gallup World Poll. Questions were answered by asking individuals to rank each question about their life on a scale of 0 to 10, with 10 being living their best possible life, and 0 being living worse than the situation they are in at the moment of the survey. Also, they were asked to rate their current life on that survey as well. Each variable was then normalized and then compared against a Dystopia; a hypothetical country which would have the lowest score in all categories — and a lower score than the lowest ranking country. While we cannot determine a conclusive impact of these variables on the Happiness Score, we can use them to get a better understanding of why some countries are ranked higher than other.

Using these seven variables, we can then attempt to construct a linear regression model which may help us predict the happiness score in the 155 countries. We can then compare the predicted score to the actual score to observe how accurate how model is. Moving on, while we’re using these variables to just build an understanding of the ranking — we can still see which variable(s) are highly correlated with happiness score and if there are any differences in these variables between 2015–2017. Below are some questions that were asked and answered in this report, and an explanation of the variables used in the World Happiness Report.

Questions To Ask:

1) How can the Data be described?
2) Can we make a prediction model that will help us predict happiness score based on multiple variables?

Variables Used in Data:

Country: The country in question
Region:
The region that the country belongs too (different than continent)
Economy:
GDP per capita of the country — individuals rank they quality of life based on the amount they earn
Family:
quality of family life, nuclear and joint family
Health:
ranking healthcare availability and average life expectancy in the country
Freedom: how much an individual is able to conduct them self based on their free will
Trust: in the government to not be corrupt
Generosity:
how much their country is involved in peacekeeping and global aid
Dystopia Residual:
Dystopia happiness score (1.85) i.e. the score of a hypothetical country that has a lower rank than the lowest ranking country on the report, plus the residual value of each country (a number that is left over from the normalization of the variables which cannot be explained).

Personal Note: The reasoning for the dystopia residual variable is unclear in the research, but my understanding is that it helps us compare a country’s happiness score against a hypothetical worse score in order to see where it ranks on the report.

1. Preparing and Describing the Data

First, we begin by importing the data into Jupyter Notebook via pandas library. The data was open source data on Kaggle (link below). It was separated into three files if someone chose to analyze all three years separately. However, I decided that it would be interesting to observe the data from a holistic point of view. Therefore, once I imported the data and removed any columns that I felt were unnecessary to THIS analysis (standard error, region, etc.), I used pd.concate to put together all three data frames, and observe the overall happiness rank based on the past three years.

This is the data at a glance, as we can see, the top 5 countries with the highest score from 2015–2017 are Switzerland, Iceland, Denmark, Norway, and Canada (woo!).

What does data mean from a statistical perspective? Well, the average score that a country received is around 5.37, the highest was 7.58, and the lowest score received was 2.69. The maximum normalized score that each category received was between 0.8 and 1.7 , and the minimum was around 0. While some individuals were very content with their country, others, not so much.

Happiness data head and a description of the Happiness data

This visual gives us a more appealing view of where each country is placed in the World ranking report.

How to read the map: the darker colored countries (purple — blue) have the highest rating on the report (i.e. are the “happiest), while the lighter colored countries have a lower ranking. We can clearly see that countries in the European, and Americas region have a fairly high ranking than ones in the Asian and African regions.

Personal Note: for my code on the graphs please visit my Kaggle kernel linked in the introduction.

Once again, this graphs shows us the countries rank by score — the countries that are a darker red have a higher score (thus are ranked higher). The countries with a lighter shade have a lower score.

How are the two correlated? This may be an obvious question, but let us look at the relationship between rating and score.

This graph may be a little confusing at first — but let’s dissect it. Since happiness score determines how the country is ranked, we place happiness score as our predictor and the happiness rank as the dependent variable. The lower the happiness score number (less than 5) — the higher the numerical rank, but from a rating perspective, it is placed lower on the world happiness report. That is, if a country’s score is 4, it will be placed at around 147 = low population happiness. The opposite is also true, the higher the score the lower the numerical rank, and the higher the happiness rating. Therefore, happiness score and happiness rank has a strong negative correlation (as score increases, rank decreases) — which in this case is what you are aiming for.

Although, happiness rank does not really tell us anything other than that in that year, where the country was placed. The important deciding variable seems to be score (as it determines rank). Thus, moving forward, let us explore the correlation between the other predictors (economy, family, trust, etc.) and happiness score. Then, try to construct a predictive model that will predict score based on these variables.

2. Constructing a Predictive Model for Happiness Score (2015–2017)

To proceed with finding a predictive model, we first drop rank from our data frame as it does not really tell us anything important in the model itself. Then, we want to get an overall idea of which variables are correlating with each other strongly. Since our focus is happiness score, let’s concentrate on that column.

The darker red the square, the stronger the positive correlation, and obviously, variables will have a correlation of 1 with each other. We can see that happiness score is really strongly correlated with economy, and health, followed by family and freedom. Thus, in our model, we should see that reflected when finding the coefficients. While trust and generosity to not have a strong positive correlation — we can see that they do have a negative correlation to happiness score, so it would be beneficial to observe these variables in our model as well.

Correlation Heat Map

Moving on, now that we have a bit of an understanding of the relationship between variables, we can start to use SkLearn to construct a model. First, we drop any categorical variables, and the happiness rank as that is not something we are exploring in this report. (That being said, we can create dummy variables to look at relationships for countries).

Then, we import sklearn’s linear regression too create something similar to a “line of best fit” for our variables. We can find the intercept, and our coefficients (if you need a recap on these terms, you can check out my previous article on basic linear regression here).

Personal Note: The link that I have provided in the code is a link to the tutorial I have used to help me construct this model. You can click on the link at the end of this article.

Here we have our model. In order to get a better visual on our coefficients, I’ve organized them in a data frame view, so we can observe which the variables in question with their coefficients that affect our dependent variable (happiness score).

Using sklearn.predict, we can use this model to predict the happiness scores for the first 100 countries in our model. How do these predictions compare to the actual values in our data?

Here, we have a plot of our actual happiness score versus the predicted happiness score. You can see that our model is a pretty good indicator of the actual happiness score! There are very small residuals, and there is a strong positive correlation between the two variables.

To do further testing on our model, we can look for the mean absolute error. This is the difference between two continuous variables, the lower the score, the better our model is at making predictions. As we can see, the score for our model when all are variables are involved is very low, 8.18 x 10e-8.

Is this the case however if we use just one variable? Perhaps we do not need all seven variables to predict happiness score. When we find the mean absolute error, we can see that the score is around 0.77. While this isn’t too bad, and if we wanted to, we can use just one or two variables, to get a better predicted happiness score, it would probably be beneficial to use the prior than the latter.

Phew! After all this, we finally have a model that we can perhaps use:

Happiness Score= 0.0001289 + 1.000041ecomomy + 1.000005family + 0.999869health + 0.999912freedom + 1.000020trust + 1.000006generosity + 0.999972DystopiaResidual

Conclusions

In conclusion, our happiness score for world happiness can be used using the model above. By using sklearn, we have build a preliminary machine learning tool that will help us generate country scores, and the higher the score, the more highly ranked the happiness of that country will be. Of course, there is always tools and analysis you can do further to this model in order to make it more accurate, and better to use. It would be beneficial to further explore a comparison between the three years in our report, and also look at comparisons between subcontinents. Although, we have a pretty good start in order to further investigate this data.

Hopefully this overview helped some of you out there that are pretty new at predictive modelling (a little like me), and how to use the tools that sklearn provides to conduct your own analysis. For more information, be sure to check out the links below. Tune in next time for another interesting case study!

Happy Coding! :)

References

--

--

Manali Shinde
One Datum At A Time

A health informatician and aspiring health data analyst. I am a photographer, writer, dancer, and public health advocate. Join me on my journey!