Correlation of COVID-19 with 2020 Happiness Perception Report

Luis Rafael Arce
MCD-UNISON
Published in
8 min readDec 16, 2020

The recent pandemic caused by the SARS-CoV2 virus has affected the entire world population in different ways. More than 71 million people have been confirmed with this infection and more than one million six hundred thousand have died from complications of COVID-19. The actions and procedures implemented by governments and their populations is a key factor in preventing losses.

On the other hand, the world report on happiness is based on a survey that is carried out year after year from 2015 to date in different countries of the world, it is a survey focused on the degree of happiness perceived by the citizens of these countries. The happiness score is explained by the following factors:

• GDP per capita

• Healthy Life Expectancy

• Social support

• Freedom to make life choices

• Generosity

• Corruption Perception

  • Residual error

I decided to do an analysis to achieve visualized if there is a relationship between some aspects of the happiness perception report with the number of people infected by the COVID19 virus related to government public policies. The analysis was performance using a couple of DATASETS obtained from Kaggle web site https://www.kaggle.com/.

1. World Happiness Report 2020, by Michael Londeen https://www.kaggle.com/londeen/world-happiness-report-2020?select=WHR20_DataForFigure2.1.csv

2. Covid19 Daily Updates, by Gabriel Preda https://www.kaggle.com/gpreda/coronavirus-2019ncov

Analysis

Was done in PYTHON in Jupyter Notebook.

Importing libraries

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

Importing 2020 HPR CSV

The first thing was to add the CSV of the happiness perception report, and convert it to DataFrame

Our dataset has information from 153 countries and 20 columns with the points to review.

After drop some columns that we don’t need.

Exploring our new DataFrame as well as searching for empty cells

datahappy.isnull().sum().values

A brief explanation of each of the characteristics of DataFrame

Country name: the name of the country

Regional indicator: Region the Country belongs to

Social support: Social support (or having someone to count on in times of trouble) is the national average of the binary responses (either 0 or 1) to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not? “

Healthy life expectancy: Healthy Life Expectancy (HLE). Healthy life expectations at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository. The data at the source are available for the years 2000, 2005, 2010, 2015 and 2016. To match this report’s sample period (2005–2019), interpolation and extrapolation are used.

Freedom to make life choices: Freedom to make life choices is the national average of responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”

Perceptions of corruption: The measure is the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or not?” The overall perception is just the average of the two 0-or-1 responses. In case the perception of government corruption is missing, we use the perception of business corruption as the overall perception. The corruption perception at the national level is just the average response of the overall perception at the individual level.

Importing COVID19 CVS

Dropping some columns that we will not use in the analysis

datacovid.drop(['Province/State','Latitude', 'Longitude', 'Date'], axis = 1, inplace=True)

Searching for NAN’s cells in new DataFrame. Some of them were found with no data but they will not affect the study.

count_nan = len(datacovid) - datacovid.count() #cantidad de celdas vacias por columna
print(count_nan)

A brief explanation of each of the characteristics of DataFrame

Country / Region: the name of the county

Confirmed: Covid19 confirmed cases by country between 2020–01–22 and 2020–12–09.

Recovered: Covid19 recovered cases by country between 2020–01–22 and 2020–12–09.

Deaths: Covid19 deaths cases by country between 2020–01–22 and 2020–12–09.

Correlation charts between the CONFIRMED, RECOVERED and DEATHS columns, both in the graphs and in the previous analysis, we can realize that there are some negative data where there should not be.

Finding negative data in the corresponding columns.

datacovid[(datacovid['Confirmed'] < 0)]
datacovid[(datacovid['Recovered'] < 0)]
datacovid[(datacovid['Deaths'] < 0)]

Data was probably added in a wrong way, I will not take that data into account for the analysis, therefore I will remove it from the DataFrame

COVID19 data contains the daily information of detected cases, so in order to join both DataFrames, the first thing to do will be to group it by country and make a sum of their records. Subsequently change the name of the Country / Region column by Country Name to have the same name of the column that will help us join both DataFrames. The result is a 143 x 12 column matrix. 143 Countries were the ones that coincided in both DataFrames.

# Rename Column
datacovid.rename(columns = {'Country/Region':'Country name'}, inplace = True)
#Grouping data
grouped_data = datacovid.groupby(['Country name'], as_index = False).sum()
#Join Dataframes
joined_df = grouped_data.merge(datahappy)
joined_df

Correlation Charts

Correlation chart of Confirmed vs Social Support cases, the size of the bubble is proportional to the number of Deaths registered

Chart of correlation of Confirmed vs Freedom to make life choices, the size of the bubble is proportional to the number of registered Deaths.

Confirmed vs Healthy life expectancy correlation chart, the size of the bubble is proportional to the number of registered deaths.

Principal Components Analysis

To obtain better information about our data, i decided to do a Principal Component Analysis (PCA) to reduce the number of components and graph their behavior.

First step is separate the characteristics, in this case from the data “Confirmed to Ladder score in Dystopia”, and the tag that we will use “Regional indicator” to group the records.

# Se separan las características
x =joined_df.loc[:,'Confirmed':'Ladder score in Dystopia'].values
# Se separan el target
y = joined_df.loc[:,['Regional indicator']].values

With “StandardScaler” from SciKit-Learn we will carry out the Standardization of the characteristics, this is achieved by eliminating the mean and scaling them so that their variance is equal to 1.

# Se estandarizan las características:from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(x)

Performing PCA with SciKit-Learn.

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(x)

The result is a matrix of 143x3, 142 lines that refer to the 142 countries and 3 principal components. Create a new DataFrame with matrix adding the target (Regional indicator) .

Chart of percentage of variance, how much each component influences. The sum of the 3 components represent. 80.7% of the variance.

#mostrando porcentaje de varianza de los 3 PCA
print(pca.explained_variance_ratio_)
#suma de porcentajes de varianza de los 3 PCA
print(sum(pca.explained_variance_ratio_))
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)px.area(
x=range(1, exp_var_cumul.shape[0] + 1),
y=exp_var_cumul,
labels={"x": "# Components", "y": "Explained Variance"}
)

PCA Loadings Table

features = ['Confirmed', 'Recovered', 'Deaths',
'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
'Freedom to make life choices', 'Generosity',
'Perceptions of corruption', 'Ladder score in Dystopia']
loadingsTable = pd.DataFrame(pca.components_.T * np.sqrt(pca.explained_variance_),
columns=['PC1', 'PC2', 'PC3'], index=features)
loadingsTable

Loadings Correlations between individual variables and Principal Components

PC1 is focuses more on the data referring to the perception of happiness, the PC2 a little more focused on what to covid19 and the PC3 a mixture of both.

z = pca.components_.T * np.sqrt(pca.explained_variance_)fig = go.Figure(data=go.Heatmap(
z=z.T,
x=features,
y=['PC1', 'PC2', 'PC3'],
colorscale='Viridis'))
fig.show()

Visualize all Principal Components

labels = {
str(i): f"PC {i+1} ({var:.1f}%)"
for i, var in enumerate(pca.explained_variance_ratio_ * 100)
}
fig = px.scatter_matrix(
principalComponents,
labels=labels,
dimensions=range(3),
color=pca_df['Regional indicator']
)
fig.update_traces(diagonal_visible=False)
fig.show()

3-D Chart to Visualize all Principal Components

import plotly.express as px
fig = px.scatter_3d(pca_df, x='PC1', y='PC2', z='PC3',
color='Regional indicator', size_max=18,
opacity=0.7)
fig.show()

Conclusions

Most countries behave in a similar way to their perception of some aspects of the happiness perception report, except for a few that have many confirmed cases of infection or a large number of deaths.

Repository & Dashboard Link

Repository of this analysis

Dashboard Link

https://rafastoievsky.github.io/Happiness-VS-Covid-Dashboard/

--

--

Luis Rafael Arce
MCD-UNISON

Industrial Engineer, full-stack web developer, master’s student in data science