Predicting Inequality in the United States | A Machine Learning Exploration

Analiz Cabrera
The Startup
Published in
7 min readSep 27, 2020

By Analiz Cabrera & Sindhu Srinath

“No two leaves are alike, and yet there is no antagonism between them or between the branches on which they grow” — Gandhi

United States, one of the most powerful countries in the world, has a diverse group of people from different races, ethnicities, gender and class since decades. The legend of America to us has been the land where dreams come true, where working hard will let you rise with endless possibilities.

We are two immigrants who came to the United States full of hope, strongly believing in education to aim higher and for better opportunities. Having recently graduated with a Master’s degree in Supply Chain from MIT, we had a different view of the social dynamics in this country and it was not close to what we envisioned when we immigrated.

If the United States is one of the strongest economies in the world, why there are large disparities in income across the population? According to the World Bank and American Census data, the United States has the highest income inequality among the G7 countries. With this in mind and with our engineering lenses on, we decided to dig into the data and apply the skills of machine learning we learnt at MIT to identify the features that could predict income inequality in the United States at the state and county level.

To measure and compare income inequality, we used the gini index. The index ranges from 0 to 1, where 1 indicates that one person makes all the income within a group and 0 indicates perfect equality. While we think that aiming for perfect equality is not necessarily the target, the United States has potential to reduce its current index (0.47) when compared to the average index (0.33) among the G7 countries.

For our modeling, we hypothesized that factors such as demographics, race, education, and federal & state spend in education would have the potential in being predictors of income inequality in the United States. Why? Because we started with a personal premise that education helps to overcome income inequality. The amount the Federal government and States spend on education contributes to increasing the number of people with higher education, and the diversity of races in the United States makes the social system more complex.

With our modeling, we wanted to understand income inequality by analyzing the data to identify the most accurate predictors among those stated above. We did not aim to answer why income inequality exists, suggest measures on how to fix it or critic the current situation. We are conscious that the answers to “why” and “how” of it would involve a more complex approach and giving a critic would require profound knowledge on the subject beyond digging data and running algorithms.

Ahead of us sharing all the details, let us reveal our main finding: Race features predict income inequality with higher accuracy than education level, federal and state spending or other demographic elements. Specifically, we found that based on the proportion of black and white population within a state, the machine learning model could predict if the state is above the median of income inequality or below with an accuracy of 96%.

Our utopian view was proven wrong by our model — we over estimated education as the main predictor of inequality in the United States.

Before we move into the prediction model and results, we want to share with you some insights about the population characteristics in the country. We really hope you learn as much as we did but most importantly we expect these facts increase awareness and ideally lead to informed action.

What we learned by analyzing the data

1. Hispanic or Latinos are the largest minority in the United States, representing 18% of the total population. They are also one of the minorities with lower percentage of population with higher education[1].

[1] Population with Bachelor’s degree or higher

According to the Census data, in 2018, the United States had ~330M inhabitants in which the Hispanic and/or Latino constituted 18.3% with Black population not far behind with ~40M people. ~5% of the population were Asians and the remaining ~4% were other races, American Indians & Alaska natives, and Native Hawaiian & other Pacific Islanders (Figure 1).

Figure 1: Percentage of population by race, 2018

2. Asians[2] have the highest percentage of population with higher education within their race

[2] Includes Far East, Southeast Asia, and the Indian subcontinent

We divided the number of people with higher education[1] by the total population of age 25 or over per race and further extracted this number per 1,000 (Figure 2).

We noted that Asians, although constituted only a ~5% of the share in the population, were the most educated[1] with 527 per 1000 people. Furthermore, they represented 10% of the total educated population[1].

After Asians, the Black population was the second most educated[1] minority (Figure 3), representing 9% of the total educated population; However, for every 1,000 educated White there were only~600 educated black.

For Hispanic and Latinos, the number of educated people was lower, with only 187 per 1,000. They represented 10% of the total educated people and compared to the White population, there were only ~500 Latinos educated per 1,000 White. Although, in the last 5 years, it is this very section of race that has shown highest growth in number of educated people with an increase of 24% between 2013 and 2018 (Figure 3).

Figure 2: No. of people with higher education per 1,000 at age 25 or over, 2018
Figure 3: Growth in educated population per race per 1,000 at age 25 or over, 2013 vs 2018

3. Massachusetts has the largest percentage of population with higher education (45%) and West Virginia the lowest (21%); both states are above the median of income inequality

Based on data from 2018, 33% of the US population over the age of 25 are educated with a bachelor’s degree or higher. Between 2013 and 2018, this increased around 4% in total. However, the spread of population with higher education is not even across the United States.

The Northeast region was above the national average of population, while 70% of the southern states were below average. The graph below shows the spread of population with higher education between states.

Figure 4: Percentage of population with higher education at age 25 or over by state, 2018

In addition, Federal and state education spend per student has increased 23% in the last 5 years but inequality has remained almost constant.

Prediction model and results

To create the prediction model, we used supervised machine learning algorithms in Python, therefore the algorithms learned from a pre-labeled training dataset.

We input data of every feature from 2015 to 2018 at the state & county level and defined two categorical targets (Figure 5):

(1) Below or equal to the median: States or counties are classified in this category, if their Gini index is below or equal to the national median

(2) Above the median: States or counties are classified in this category, if their Gini index is above the national median

Figure 5: Classification Illustrated

As mentioned before, the features considered were related to demographics, race, education, and federal & state spend in education (Table 1).

Table 1: Description of features considered

We randomly split the dataset into training & testing data to run the algorithms with different combinations of features. We compared the performance among them based on the testing accuracy.

The accuracy of the models (Table 2) showed that a voting or random forest classifier considering only race features, predicted income inequality with the highest test accuracy — 96%.

Table 2: Test accuracy by algorithm and features

We reviewed in more detail the effect of the distinct races on the prediction. We identified the percentage of black and white population as the races that impacted the most on the classification of the states.

As seen in Figure 6, the states with more inequality — (2) Above the median of inequality — had a higher percentage of black population, with at least 8%.

Figure 6: Classification of states by percentage of black population

The opposite happened if the states had a higher percentage of white population (Figure 7). The higher the percentage of white population, the lower the probability for inequality.

Figure 7: Classification of states by percentage of white population

The results we gathered are consistent to the current social concerns in the United States and even while we were aware of it, by reading the news, and through conversations with our friends and professional circles, having the data depicting reality was a wake-up call.

We know governments, corporations, educational institutions and other social entities are working to transform for the better. We hope the actions they are taking focus on bringing tangible solutions, however we believe that is also on us to take a stance, to contribute in a way that matters, to look beyond ourselves, and work towards the country we dream it to be.

--

--