Progress Towards Reducing Global Inequality
TrueCue and Women+Data Hackathon 2021 Submission — Team Violet
- Technical Solution
- Future Thoughts
- Team Violet
Coming across the TrueCue and Women+Data Hackathon, what drew most of us to this project was the fact it was aimed specifically at women and surrounding a globally important topic — sustainability. A topic that affects our daily lives in different forms in a number of different sectors. Sustainability doesn’t just refer to climate change, it also refers to putting in place long term strategies that will improve access to education, building resilient infrastructure and so much more.
As women with backgrounds in different fields trying to break into a data-based world or having just entered the field, it was really interesting to see an event targeted at getting women more comfortable with data and an opportunity to connect and learn from each other.
The task was very open-ended and very much down to the personal choice of each group. We were given a dataset that contained information on how countries across the globe were progressing with respect to the United Nations (UN) Sustainable Development Goals (SDGs) over time (2000–2018). This data had been pulled together by TrueCue from multiple sources. It was a bit daunting given how many choices and routes were available. There are 17 SDGs and the dataset contained 54 columns of information for 173 countries. But after all, going away and doing some research, it was clear that there was a common goal that resonated well with all of us which was Sustainable Development Goal 10 — reduce inequality within and among countries. Those affected by this goal included those who earned below the median income, those without employment, those without voting rights, those with less social-economic power and refugees.
With the scope narrowed down to one goal, we looked at the data set we were given. We found that we did not have enough data relevant to our SDG choice. After much discussion and research, we decided to create our own measure for SDG 10 that would be more accurate than what our research had found, encapsulating both economic and social inequalities.
Overview of the Dataset
Our final dataset most closely represents equality metrics associated with targets 2,3, and 4 under SDG10. These broadly relate to promoting inclusion, equal outcomes, ensuring economic and social protections. We wanted to procure as much data as possible relating to the specific metrics underlying the UN’s conception and measurement of SDG10 in order to have the data represent it more holistically and effectively. It was interesting and encouraging to see so many metrics regarding the social, economic, and political empowerment of women.
The final dataset consisted of 2540 rows and 18 columns. In the table below, we have listed all the column names and their meanings.
To achieve this dataset, we first cleaned the dataset given to us during the hackathon. We only used data taken from the World Bank. This came in a wide format, each row represents one country and the columns correspond to the values in each year. Since our unit of analysis is country-year, we had to convert the format from wide to long. We transposed the data using the melt function such that each county has multiple rows of data–one for each year. We then enriched data that we retrieved from the V-dem Institution which has been bolded in the table.
Before merging the datasets together, we replaced some of the country names, for example, ‘Czechia’ with ‘Czech Republic’ to ensure consistency throughout the final dataset. Factor Analysis does not allow for missing values so we only kept the complete cases for the analysis.
We did not immediately plan to go down the route of creating our own index. The idea arose when we were exploring pre-existing indexes. We found that the Gini coefficient — which we attained from the initial dataset — only covered income inequality in countries and as such it was not a good measure of inequality which comes in so many different forms (see SDG10 Targets and Indicators). There were many missing values for the Gini Index from the TrueCue dataset that we could not recreate or impute without skewing our findings.
Another consideration was the Human Development Index (HDI) from the United Nations Development Program. It measures both social and economic factors such as the number of people per doctor, literacy rate, gross national income per capita to list a few. However, a common criticism of the HDI is that it is too simple and that financial factors still hold too much importance within the HDI weighting so we developed our index to see if it could measure inequality any better.
First, we tried to use Principal Component Analysis (PCA) but that did not work because the first component explained less than 50% of the variance so we tried Factor Analysis (FA) instead. Factor Analysis is an unsupervised machine learning technique that reduces a large number of variables down to a few important factors.
Using FA, the most important variables can be put into 3 distinct categories: access which describes exclusion by gender/socioeconomic status/urban-rural/political group/social group, outcomes which describes politico-economic conditions such as political equality, freedom from forced labour, free movement within the country, the proportion of GDP and voting which describes a citizen’s right to vote. These 3 factors make up 73% of the data variation.
Our analysis made progress towards solving the problem of effectively measuring inequality worldwide, with respect to the UN’s definition of inequality. We were able to show using our case study on Thailand that our index can be used to indicate both areas for improvement when it comes to inequality, but also for identifying dips and rises in inequality that may warrant further investigation.
In the case of Thailand, we saw a significant drop in our index measure in 2006. Through research, we were able to determine that this was due to a military takeover that eliminated voting rights.
For our work on the index, we were awarded the Best Advanced Analytics/Predictive Modelling by the panel of judges at the finals.
There are many avenues that each member would have liked to explore, however, we were hindered by the lack of data available which was our biggest limitation. For example, there was not enough information surrounding factors such as a country’s technological development (Solow Residual), details on the disabled population nor was there any way to process qualitative things like the types of laws within a country.
Moreover, the lack of data is severe in certain parts of the world: Africa, the Caribbean and the Pacific Islands to be more specific. This raises awareness of the issue that the international community often neglect or do not pay as much attention to these regions. Although regime change, natural disasters or other unpredictable/uncontrollable events might curb the process of collecting needed data, throughout the past 20 years, we have seen the need to thoroughly understand and recognize what is going on in these regions.
One of the biggest lessons we have learnt from this project is to be prepared to find alternative solutions such as when we first attempted to construct the index using PCA. Ideally, the first principal component (PC1) would account for at least 70% of the variation, in which case, we can simply bring the factor loading values of PC1 into a single index using linear regression. However, that did not go according to plan and we had to look for alternative techniques to construct the index.
If we could do the project again, we would focus more energy and resources on familiarising ourselves with the assumptions and data requirements of FA. For example, the indicators have to be in the same unit of measurement and should not be ordinal or categorical variables. This would have saved us a lot of time on finding replacement variables.
We believe that with more data and tweaking the model that our index can be used to improve inequality measurement in the future. It would be worth exploring grouping the Goal 10 indicators into socio-political, intra-country economic, and inter-country economic. However, the lack of data is a huge impediment; we would also need to account for inequality indicators 1, 5, 8 (a), 9 (b) and 10 (c) which are more closely related to the intercountry economic measures such as trade.