Q#91: Income inequality and happiness
Suppose you are given two datasets as shown below:
- Data on the Gini coefficient (proxy for income inequality). This is a number between 0 and 1, where 0 corresponds to perfect equality (e.g. everyone has the same income) and 1 corresponds to perfect inequality (where one person has all the income — and everyone else has no income). You can read more about the Gini coefficient on Wikipedia here
- Data containing a Happiness Score from the World Happiness Report. The score represents a weighted average across a number of variables and ranges from 1–10, where 10 is perfectly happy. You can read more about the World Happiness Report Wikipedia here
Given this data, determine if there is a correlation between income equality (field: ‘current’ in Gini index dataset) and happiness (field: ‘overall_score’ in Happiness dataset). You’ll only want to keep records that exist in both datasets (there are many countries/regions in the World Happiness Report that do not have a measured Gini index).
TRY IT YOURSELF
ANSWER
Alright, today we are given a task to evaluate if there is a relationship between two variables that do not exist in the same table, something that occurs all too often in Data Science. Let’s not panic we have been here before, we can break it out into steps and deliver our result as a visual.
So, it goes without question that the first step is to use pandas to load the data.
import pandas as pd
gini_index = pd.read_csv('https://raw.githubusercontent.com/erood/interviewqs.com_code_snippets/master/Datasets/gini_index.csv')
happiness_index = pd.read_csv('https://raw.githubusercontent.com/erood/interviewqs.com_code_snippets/master/Datasets/happiness_index.csv')
Alright, next let's merge the data, the question clearly mentions that not all rows in one table exist in the other table. Additionally, we are only interested in the overlap, so this suggests that we can do an inner merge on the two dataframes using pd.merge().
# Inner join the data
combined_df = pd.merge(gini_index, happiness_index, left_on='country', right_on='country_or_region', how='inner')
Now, lets answer the question, we can use the .corr() method in pandas to get the correlation.
# Calculate the correlation coefficient
correlation_coefficient = combined_df['current'].corr(combined_df['overall_score'])
Finally, since we are great data scientists, lets flex the result in an image or visualization to get our point across (in interviews this are extreme bonus points you can earn, more of our work is involved in explaining the results to someone else and a picture is worth a thousand words). We can use the seaborn package to make a nice scatter and regression plot.
import seaborn as sns
import matplotlib.pyplot as plt
# Create a scatter plot with a correlation line
plt.figure(figsize=(10, 6))
sns.scatterplot(data=combined_df, x='current', y='overall_score')
# Add a regression line (correlation line) to the scatter plot
sns.regplot(data=combined_df, x='current', y='overall_score', scatter=False, color='red')
# Set plot title and labels
plt.title(f'Scatter Plot of Gini Coefficient vs Happiness Score\nCorrelation: {correlation_coefficient:.2f}')
plt.xlabel('Gini Coefficient')
plt.ylabel('Happiness Score')
# Display the plot
plt.show()
Plug: Checkout all my digital products on Gumroad here. Please purchase ONLY if you have the means to do so. Use code: MEDSUB to get a 10% discount!