Towards equality and well-being in Yucatán — Fab City Yucatán

Published in

Fab City Yucatán

8 min readMar 28, 2019

How close is a city to being resilient, self-sufficient and guaranteeing opportunity and well-being to its citizens so that they can be the masters of their own destiny?

This is an overarching challenge that the Fab City global initiative is trying to answer. Our goal is to make cities at least 50 % self-sufficient by 2054. You can read more about this project in the whitepaper or in the official Medium blog.

Fab City Resilience Index

I have the conviction that, by better understanding what well-being means, we can improve our own quality of life. The Fab City initiative is using a monitoring system to track cities’ and citizens’ well-being based on the well-being and better life index of the OECD. If we, as a Fab City Network, can quantify well-being on a global scale, we can keep ourselves accountable for the process of making our own cities and regions more sustainable. There are a number of examples that take similar approaches. For example, the city of Santa Monica won Bloomberg’s mayor Challenge by developing their own Well-being index. Other famous examples include Dubai and Bhutan that have implemented a happiness index.

But before to make a radical change, one needs to deeply know what the current state of a system is. One alternative I always like to use is visualizing data to have a flavor of the structure of a system. In the Fab Lab Yucatán, we have been working to develop data science skills, to contribute to both the Fab City Yucatán and the global initiative. In this article, I want to share some basic data analysis skills using the OECD well-being index data. This is by no means an exhaustive analysis, my intention here is give you a flavor of the workflow you can do with python. You can download the notebook in which I made the analysis here, and the data here.

Well-being index from the OECD

The OECD has been making efforts towards mapping out global well-being data and has a very nice visualization engine to explore it (see figure below from the Yucatan state of well-being). They also have released their data openly for people to analyze and derive public policies. Before proceeding, I’d like to emphasize that these are global data and are min-max normalized across all the dataset, that is, countries might not be readily comparable. You could imagine that people in a given country have a lower income but they also have fewer expenses. Indeed, the analysis of large scale data about well-being is a topic of research, if you want to know more about this I suggest you to read this paper. To dig deeper into what does each variable measures and what is the normalization method please refer to this document. To show an example, let’s look at the well-being data available for the state of Yucatán in México.

Wellbeing index for the state of Yucatán, México.

We can see that in Yucatan there is a relatively high level of safety, life satisfaction, environment, and community. However, we still have a long way to go in terms of housing infrastructure, education, income, and health.

Reproducible data analysis

In this post we’re going to do exploratory data analysis to better understand the relationship between different variables associated with well-being. The well-being data takes into account variables like education, income, accessibility to services, community, and others to make a holistic “radar” of the quality of life in a given place. We will also see that the variance of these variables could be a metric for inequality in a country or region.

First, we’re going to import some python libraries and load our dataset into the jupyter notebook.

For example, by previous intuition, we know that education and income levels are positively correlated in Yucatán, that is, the well educated tend to have high levels of income and vice-versa. Below, we share an interactive plot showing just that, using the amazing holoviews library. In this plot, each dot represents a single municipality in Yucatán.

Interactive visualization of illiteracy vs poverty

We can go ahead and see if this previous observation shows up in the regional OECD data. To do this, we’ll use the Seaborn function jointplot. This graph has the flexibility to choose from a a regular scatter plot, a hexagonal bins plot or a 2-D KDE plot. We’ll choose the latter with contours set 15 levels. In this plot, the darker the color the more densely populated region of the plot. Moreover, we’ll get 1-D KDEs on both axes, to visualize the individual distributions.

Education vs Income 2-D KDE plot. We can see that education is bimodal (histogram on the top), and Income is almost multimodal with only a few countries having high levels of income.

Voilá. Indeed, our data shows that education and income are correlated (Pearson’s r > 0.7). Moreover, we can see that the education index distribution is bimodal, with a small left tail. That is, that on average, there are more regions in these dataset that have good education levels. However, we can think that this observations are a bit biased, because this dataset doesn’t include a big portion of the world where there is still a lot to do in terms of education. If you’re interested in the evolution of the quality of education please refer to this excellent article from ourworldindata.org.

To continue with our quest of visualizing the relationship of education and income levels, let’s plot individual data points colored by continent.

Education vs Income level scatterplot. Individual data points represent regions according to the OECD.

Now the story gets more interesting! In the plot above, individual data points represent individual regions in the OECD (for example in Mexico, regions represent states). We can see from the data that, in the American continent, there are regions that form clusters of different levels of education and income. Some regions have very high education/income levels, like regions in the US and Canada; and other regions have low to medium levels, which represent Mexico and Chile. In this sense, we could say that in terms of these two variables, the American continent is quite unequal.

Variance as a measure of inequality

Now we can continue our analysis by plotting the distribution of education grouped by continents. I visually like the violin plots to plot distributions. Violin plots are like a mirrored version of KDEs. I also like to visualize the quartiles of the distributions to get a better sense of the data. A more “unbiased” way to visualize distributions using seaborn is to use a swarmplot, because you can see individual data points. Another option to directly compare distributions by plotting individual points is to use ECDFs.

Distributions of education indices across continents.

We can see that America and Asia are the distributions with the biggest spread. We can also see that almost all distributions go below 0 and above 10, but that is just an artifact from the KDE smoothing the distribution.

Now, we can go ahead and ask: how unequal are different regions in the world on a continental scale? We’ve previously seen from the scatterplot that in terms of education, America is quite unequal.

We can compute this “unevenness” or inequality by measuring the variance of our distributions. Let’s take education as an example.

Variance of education across regions in different continents.

We can see that America is indeed the continent with the biggest spread in its distribution, and thus has the highest variance; meanwhile Australia has the smallest variance as it’s only representing the countries of Australia and New Zealand.

Now, let’s loop over the Pandas DataFrame columns to visualize different variables across continents.

Regional distributions of OECD indicators grouped by continents

Correlation between different variables

Another interesting thing to extract from the data is to ask what is the relationship between the different variables. A simple approach is to calculate the correlation between them. We can easily compute this using python.

To get a sense of the possible correlation between the different variables, we can make a pairgrid using Seaborn. The pairgrid is a all-vs-all plot that uses the columns in a dataframe, to visually inspect possible relationships. Moreover, it also plots the distributions of individual columns in the diagonal. Here we will specify that in the diagonal we will have violinplots, in the upper side of the grid we will have kdeplots, and on the bottom side, scatterplots. Notice that the top and bottom diagonals are mirrored plots just with different representations.

Pairplot of the OECD wellbeing variables

Finally, we can compute the correlation matrix to calculate the pearson correlation between indices. We can then visualize this correlation matrix using a clustermap, that is a heatmap ordered using a hierarchical clustering procedure. This is a simple approximation to inspect if there are different indices of well-being that have a high correlation between them, and form clusters. These clusters could be a way for governments and civil society to attack these issues in a holistic way.

Correlation matrix for the regional data.

From top to bottom, we can see that there are three clusters, one linking safety and health, the second linking jobs, environment, and life satisfaction and the final one, a big cluster linking income, housing, education and accessibility to services. These clusters match our previous intuition of education and income being correlated and associates other factors that could be important to approach systematic solutions. For example, you could imagine that by increasing levels of sociability with community activities in a neighborhood, could increase the levels of innovation and thus lead to better income. These highly complex problems have to be approached from a bottom-up technique such as citizen science, and a top-down view like data analysis.

That’s it! We hope this article will help the Fab City community and other projects to analyze their data using simple data science tools. If you have any questions, please comment them below. If you liked this post you can follow the Fab City Yucatán medium, or contact us directly at fabcityyucatan@gmail.com.

As I’ve already mentioned, if you’re interested in the guts of this dataset (what indices and how they were normalized ) you can read more about it here.