Stepwise Regression & Factor Analysis of World Happiness Data Using SPSS

Victor Hew
12 min readAug 14, 2024

--

Stepwise Regression Analysis of World Happiness

Countries nowadays focus more on boosting citizens’ happiness. However, the term happiness is broad and complex as it is subject to people’s definition of happiness which is indicated by various factors. Tavor et al. (2018) found that gross domestic product (GDP) per capita was the most significant and positive factor in affecting happiness across all populations. López-Ruiz et al. (2021) indicated the significance of freedom, positive attitude towards neighbours and family support in positively affecting happiness in the Spanish population. Song et al. (2023) indicated life expectancy was a determining positive factor of happiness among Asian adults. Ma et al. (2022) discovered the mediating role of government satisfaction in the positive relationship between the corruption perception index (CPI) and happiness among the Chinese community. Thus, this study investigates the extent to which each of the above-mentioned factors predicts happiness.

Problem Statement

The problem to be addressed is the complexity of various factors like GDP, social support, health, and corruption in predicting happiness as very few studies have been conducted on predicting happiness based on different factors across different domains like society, economy and government. Besides, this study features the rare inclusion of CPI in the analysis as it is not the common variable to be analysed in many studies compared to social and economic factors, so it would be interesting to determine the significance of CPI in predicting happiness.

Research Objectives

Four objectives are outlined below:

  1. To determine whether GDP per capita would significantly and positively predict happiness.
  2. To determine whether social and family support, generosity and global sadness index would significantly and positively predict happiness.
  3. To determine whether life expectancy and freedom would significantly and positively predict happiness.
  4. To determine whether CPI and government trust would significantly and positively predict happiness.

Four corresponding hypotheses are developed below:

H1: GDP per capita would significantly and positively predict happiness.

H2: Social and family support, generosity and global sadness index would significantly and positively predict happiness.

H3: Life expectancy and freedom would significantly and positively predict happiness.

H4: CPI and government trust would significantly and positively predict happiness.

Analysis Results and Interpretations

Dataset Used

The dataset revolves around the happiness scores of different countries and the possible associated variables such as GDP per capita, family, life expectancy, freedom, generosity, government trust, global sadness index, social support, and CPI (see Figures 1 & 2).

Figure 1: The first 31 rows of the dataset used (Part 1)
Figure 2: The first 31 rows of the dataset used (Part 2)

Assumption Testing

Normality

The errors are normally distributed based on the normal distribution of the histogram (see Figure 3). Besides, based on the normal probability (p-p) plot below, the residual points fall to the line very closely (see Figure 4). The Kolmogorov-Smirnov test is chosen as the normality test instead of Shapiro-Wilk as the sample size for this dataset is more than 50. No significant difference is found between the sample distribution and the normal population distribution as the p-value is more than .05, which indicates that the sample distribution is normal, Kolmogorov-Smirnov (792) = .029, p = .162 (see Figure 5). Therefore, the model assumption of normality is met.

Figure 3: Histogram of Standardized Residuals
Figure 4: P-P Plot of Standardized Residuals
Figure 5: Tests of Normality

Homoscedasticity

Based on the scatterplot of regression standardized residual against regression standardised predicted value, there is a uniform spread of data points which are concentrated within the standardised residual range between -3 and +3, indicating that there is a constant variance of error terms at each level of predictors, therefore the assumption of homoscedasticity has been met (see Figure 6).

Figure 6: Scatterplot of Regression Standardized Residual Against Regression Standardised Predicted Value

Independence of Error

The Durbin-Watson score is 1.36, which is not within the acceptable range of scores between 1.5 and 2.4 (see Figure 7). This indicates that there is a correlation between the residual terms. Thus, the assumption of independence of error is violated.

Figure 7: Durbin-Watson Score

Linearity

The relationship between each predictor including GDP per capita, life expectancy, freedom, government trust and CPI and the outcome variable, happiness, can be captured by a straight line, thus the assumption of linearity is met for these relationships (see Figures 8, 9, 10, 11 & 12). However, the relationship between other predictors such as family, generosity and global sadness index and happiness cannot be captured by a straight line, thus the assumption of linearity for these relationships is violated (see Figures 13, 14, 15 & 16).

Figure 8: Scatterplot of Happiness Against GDP per Capita
Figure 9: Scatterplot of Happiness Against Life Expectancy
Figure 10: Scatterplot of Happiness Against Freedom
Figure 11: Scatterplot of Happiness Against CPI
Figure 12: Scatterplot of Happiness Against Government Trust
Figure 13: Scatterplot of Happiness Against Family
Figure 14: Scatterplot of Happiness Against Generosity
Figure 15: Scatterplot of Happiness Against Global Sadness Index
Figure 16: Scatterplot of Happiness Against Social Support

Multicollinearity

Figure 17 demonstrates that all the variance inflation factor (VIF) values for each predictor in the final model are below the maximum threshold of ten. All the tolerance values for each predictor in the final model are more than .10 (see Figure 17). Therefore, the assumption that there is no multicollinearity of predictors is met.

Figure 17: VIF Score & Tolerance Score

Inferential Statistics & Hypotheses Testing

Figure 18 shows a significant strong positive correlation between GDP per capita and happiness, r (790) = .793, p < .001, life expectancy and happiness, r (790) = .754, p < .001, and CPI and happiness, r (790) = .693, p < .001. A significant moderate positive correlation exists between freedom and happiness, r (790) = .544, p < .001, and government trust and happiness, r (790) = .455, p < .001. However, there is a significant yet very weak positive correlation between social support and happiness, r (790) = .193, p < .001, global sadness index and happiness, r (790) = .174, p < .001, and family and generosity and happiness, r (790) = .155, p < .001. Therefore, social support, global sadness index, family, and generosity, are excluded from further analysis.

Figure 18: Correlations Table

The overall model of GDP per capita, freedom, life expectancy and government trust significantly predict happiness, F (4, 787) = 581.94, p < .001, explaining 74.7% of the variance in happiness, R2 = .75 (see Figure 19 & 20). As the p-value is less than .05, the overall model is a good predictive model of the outcome.

Figure 19: Model Summary
Figure 20: ANOVA Table

Based on Figure 21, GDP per capita is a significant predictor of happiness, after controlling for freedom, life expectancy and government trust, b = 1.38, 95% CI [1.21, 1.54], t (787) = 16.32, p < .001, sr = .29. Freedom is a significant predictor of happiness, after controlling for GDP per capita, life expectancy and government trust, b = 1.82, 95% CI [1.50, 2.13], t (787) = 11.28, p < .001, sr = .20. Life expectancy is a significant predictor of happiness, after controlling for freedom, GDP per capita and government trust, b = 1.30, 95% CI [1.03, 1.58], t (787) = 9.30, p < .001, sr = .17. Government trust is a significant predictor of happiness, after controlling for freedom, life expectancy and GDP per capita, b = 0.92, 95% CI [0.49, 1.34], t (787) = 4.23, p < .001, sr = .08. The regression equation is:

Happiness = 2.46 + 1.82 (Freedom) + 1.38 (GDP per capita) + 1.30 (Life Expectancy) + 0.92 (Government Trust)

An increase in one unit of freedom results in a 1.82 unit increase in happiness, so freedom is the best predictor of happiness compared to others. An increase in one unit of GDP per capita results in a 1.38 unit increase in happiness. An increase in one unit of life expectancy results in a 1.30-unit increase in happiness. An increase in one unit of government trust results in a 0.92 unit increase in happiness. GDP per capita has two times more effect on happiness than freedom and life expectancy. It has five times more effect on happiness than government trust.

Therefore, H1 and H3 that GDP per capita, life expectancy and freedom would be a significant positive predictor of happiness are supported. However, H4 that CPI and government trust would be significant positive predictors of happiness is partially supported as government trust is a significant predictor whereas CPI is not included in any models. H2 that generosity, global sadness index and social and family support are significant positive predictors of happiness is not supported.

Figure 21: Coefficients Table

Conclusions & Recommendations

To conclude, GDP per capita, freedom, life expectancy and government trust are significant positive predictors of happiness whereas social and family support, generosity, global sadness index, and CPI are not. The government can implement new initiatives from the socioeconomic and health aspects to boost the GDP, freedom, and life expectancy of citizens so that they can gain more public support. Some of the study’s limitations include the violation of the assumption for linearity and independence of error which may affect the reliability and validity of the results, causing researchers to interpret the results with caution. One area of improvement is using the bootstrapping method to predict statistics like mean and standard errors for each simulated sample through repeated data sampling more precisely.

Factor Analysis of World Happiness

Purpose for Inclusion of Continuous Variables

One of the reasons for the inclusion of continuous variables is data reduction. This dataset has as many as nine continuous independent variables which result in complex interrelationships of continuous variables with the same attributes, so data dimensionality must be minimised to uncover better and understand the underlying relationships of classified variables and ultimately generate at least one factor at the benefit of reducing the dataset size and making the model more interpretable. Another reason revolves around multicollinearity. This dataset possibly has very strong between-predictors correlations, so factor analysis can form factors that contain grouped predictors with redundancy and have very strong correlations with each other to replace the original ones.

Reasons for Exclusion of Non-continuous Variables

One reason for the exclusion of non-continuous independent variables revolves around correlation. Factor analysis requires Pearson’s r to compute between-variables correlation coefficients for better data pattern understanding. Moreover, the assumptions of multicollinearity and similarities of factor structure across different sample factor solutions must be fulfilled with continuous variables as the prerequisite for computation so that sampling and correlation sufficiency can be measured before proceeding to other analyses like communalities and eigenvalues. The second reason revolves around the inherent characteristics of non-continuous variables. These data lack the linearity and inherent meaningful between-variable order properties of continuous variables, which may derail the process of researchers in interpreting the factor loadings.

SPSS Outputs

Figure 22: Correlation Matrix (Original)
Figure 23: Correlation Matrix (Without ‘Family’)
Figure 24: Kaiser-Meyer-Olkin (KMO) and Bartlett’s Test
Figure 25: Communalities
Figure 26: Eigenvalues
Figure 27: Scree Plot
Figure 28: Unrotated Component Matrix
Figure 29: Rotated Component Matrix

Grouping Continuous Independent Variables into Factors

Analysis Results and Interpretations — Part 1

Communalities

Communalities, or estimated common variances, are the extent to which the variance portion of a variable is shared with other variables, with 0.5 being the minimum value to be achieved by each variable. Figure 25 indicates that 85.4% of the variance in social support is shared with GDP per capita, life expectancy, freedom, government trust, CPI, generosity, and global sadness index. The lowest communalities are achieved by government trust, with 62.5% of its variance sharing with the other seven variables. Still, all variables exhibit communalities greater than 0.5, suggesting that their variances are well explained by the common factors.

Eigenvalue

Eigenvalue is the total squared loadings of all variables in the particular component which explains the portion of the total variance, with one being the minimum value to be met for factor retainment in terms of latent root criterion. Figure 26 indicates that factor 1 explains 39.48% of the total variance as it has an eigenvalue of 3.158. Factor 2 has an eigenvalue of 1.833, explaining 22.92% of the total variance. Factor 3 explains 14.35% of the total variance as it has an eigenvalue of 1.148. For the subsequent factors, their eigenvalues are below one, so they are not retained in the analysis. This is further indicated in the scree plot, with the curve flattening at Factor 4 as the inflection point (see Figure 27). Any factor number greater than four indicates the outweigh of unique over common variance. Therefore, only the first three factors are retained in the analysis.

Analysis Results and Interpretations — Part 2

Pre-Factor Analysis — Factorability Improvement

One of the factorability improvements is correlation. The correlation matrix indicates that the greatest correlation coefficient exists for family and social support (r = -.870), which does not fall within the acceptable range of scores between 0.3 and 0.85 regardless of direction (see Figure 22). Therefore, family is dropped from the rerun analysis. After rerun, the correlation matrix (see Figure 23) indicates the greatest coefficient exists for the correlation between GDP per capita and health (r = .775), freedom and CPI (r = .483), government trust and CPI (r = .620), CPI and GDP per capita (r = .704), generosity and freedom (r = .306), and global sadness index and social support (r = -.655). Overall, all variables are adequately correlated with one another.

Another improvement revolves around the measure of sampling adequacy. Figure 24 indicates that the correlation between variables exists accompanied by a high partial correlation among them as the KMO value is .683, which is above the minimum value of 0.5. Therefore, the assumption of multicollinearity is met. In terms of Bartlett’s Test of Sphericity, the p-value is < .001, which is lower than the 0.05 significance level. This indicates the existence of sufficient between-variable correlation. Therefore, the eight variables are interrelated, and factor analysis is doable.

Cross-Loading

The cross-loading phenomenon occurs when above-average significant loading (> 0.5) of a given variable is found on different factors in the unrotated component matrix. Figure 28 demonstrates that for generosity, the loading in Factor 2 is 0.532 whereas in Factor 3 is 0.63, suggesting that there is a cross-loading issue for generosity across the second and third factors.

Minimizing Cross-Loading

Orthogonal rotation can reduce cross-loading, where right angles are kept constant for the axes of all factors when rotating about the origin and there is no between-factor correlation. Using Varimax which boosts every squared factor loading variance, the rotated component matrix with simplified columns shows that there is no cross-loading (see Figure 29). GDP per capita, life expectancy and CPI have the greatest significant loadings on factor 1, global sadness index and social support on factor 2, and freedom, government trust and generosity on factor 3. Another suggestion to reduce cross-loading is oblique rotation where right angles are not kept constant for the axes of all factors when rotating about the origin and there is a between-factor correlation.

References

López-Ruiz, V. R., Huete-Alcocer, N., Alfaro-Navarro, J. L., & Nevado-Peña, D. (2021). The relationship between happiness and quality of life: A model for Spanish society. Plos One, 16(11), e0259528. https://doi.org/10.1371/journal.pone.0259528

Ma, J., Guo, B., & Yu, Y. (2022). Perception of official corruption, satisfaction with government performance, and subjective wellbeing — An empirical study from China. Frontiers in Psychology, 13, 748704. https://doi.org/10.3389/fpsyg.2022.748704

Song, C. F., Tay, P. K. C., Gwee, X., Wee, S. L., & Ng, T. P. (2023). Happy people live longer because they are healthy people. BMC Geriatrics, 23(1), 440. https://doi.org/10.1186/s12877-023-04030-w

Tavor, T., Gonen, L. D., Weber, M., & Spiegel, U. (2018). The effects of income levels and income inequalities on happiness. Journal of Happiness Studies, 19(7), 2115–2137. https://doi.org/10.1007/s10902-017-9911-9

--

--