Adult Census Income — Analysis

Ali Asghar Aamir
Data Warriors
Published in
10 min readNov 11, 2019
source: https://justcreative.com/2019/07/16/passive-income-ideas/

In ancient times, the ability to predict the future was called precognition. Nowadays we call it machine learning. The rapid improvement in computer performance and an increase in storage abilities have allowed us to dabble in this art.

I recently stumbled across this dataset and thought of exercising my computational witchcraft abilities. This dataset intrigued me because of its diversity and richness — data from a person’s level of education to their spouse being in the Armed Forces.

However, there is one big issue; this dataset is fairly old. It was extracted from the 1994 Census bureau database. Although I might not be able to apply my conclusions here to the current generation, it would be a good exercise for my machine learning spells.

The dataset contains information about the annual incomes of people from 42 different countries, but the majority (90%) is dominated by the United States. The runner-up in this category is Mexico at 2%, leaving only 8% for the other 40 countries.

Therefore, I thought of fine-tuning my spells by filtering the dataset to only include the United States.

Now let’s begin our adventure.

Motivation:

The main objective behind our project is to predict if a said person, given his attributes, earns more than $50k per annum or not.

Simple, right?

So, let’s dig right in…

We begin by cleaning the data, and then move-on to exploratory data analysis of the dataset. Following that, we prepare the data for our machine learning model and then train the model using that data.

Sorry for the bombardment but let’s try and make it as simple as possible and break it down into parts.

1. The Dataset:

The dataset contains 32,561 entries with a total of 15 columns representing different attributes of the people. Here’s the list;

  1. Age: Discrete (from 17 to 90)
  2. Work class (Private, Federal-Government, etc): Nominal (9 categories)
  3. Final Weight (the number of people the census believes the entry represents): Discrete
  4. Education (the highest level of education obtained): Ordinal (16 categories)
  5. Education Number (the number of years of education): Discrete (from 1 to 16)
  6. Marital Status: Nominal (7 categories)
  7. Occupation (Transport-Moving, Craft-Repair, etc): Nominal (15 categories)
  8. Relationship in family (unmarried, not in the family, etc): Nominal (6 categories)
  9. Race: Nominal (5 categories)
  10. Sex: Nominal (2 categories)
  11. Capital Gain: Continous
  12. Capital Loss: Continous
  13. Hours (worked) per week: Discrete (from 1 to 99)
  14. Native Country: Nominal (42 countries)
  15. Income (whether or not an individual makes more than $50,000 annually): Boolean (≤$50k, >$50k)

2. Data Cleaning:

To start off, the data seems to already be pre-processed, since missing values are consistently denoted by a question mark (i.e. “?”) and there are no null values in any of the columns.

Missing Values: Missing values are represented by “?” in this dataset. Let’s check how many of those question marks each column has.

Counting Missing Values

Okay, so we have our results. There are three columns with some missing values:

  • workclass = 1836 missing
  • occupation = 1843 missing
  • native.country = 583 missing

Hmm, the count of missing values in workclass and occupation column seem to be pretty close. That can’t be a coincidence, can it? Let’s put our detective hats on again:

Number of data points with missing occupation and workclass

Since the intersection of data points with missing occupation and missing workclass is the same as the number of data points with missing workclass, we know that where ever occupation is missing, workclass is too.

Now we need a strategy to deal with the missing values. (Note: We’ll only show the method for dealing with missing values in the workclass column. However, the same strategies will apply to the other two columns with missing values.)

Dealing with Missing Values:

Method 1: Boolean Column

Marking data points with missing workclass

We make a separate column for missing values in workclass (workclass.missing). We marked 1 (true) if workclass is missing and 0 if it is not. This method would allow the machine learning algorithm to learn when the value is missing and would lead it to assign lesser weight to the missing column when predicting (e.g. when there is a 1 in the “workclass.missing” column, the machine learning algorithm would give less weight to the corresponding workclass column).

Method 2: Machine Learning

Using machine learning to predict the missing values

We use machine learning to predict the missing values.

First, we separate the data points which have missing values for work class (“test_data”) and those which do not (“train_data”). We then use the “train_data” to calculate the parameters of machine learning and apply those parameters to “test_data” to predict the missing values for the workclass algorithm.

We will use different machine learning algorithms: Logistic Regression, Random Forest Classifier, KNeighbors Classifier, and Decision Tree Classifier. For now, we are using predictions made via Random Forest Classifier. However, we may change our choice of the algorithm depending on future results.

Random forest classifier
Decision Tree classifier

We haven’t calculated the accuracy of these algorithms yet. We will do that later on.

3. Exploratory Data Analysis:

So now we have the cleaned data. Let’s jump right into exploratory data analysis to observe the relationship between income and other variables.

Correlations: We start by computing the correlation of income with other numerical variables. Since the income column is not numerical, we use one-hot encoding for it using dummy variables.

correlation heatmap
Heatmap: Correlations between numerical columns

Interesting. Let’s analyze the correlations:

  • Hours per week has a weak uphill positive linear relationship with a correlation coefficient of 0.23
  • Education number has a weak uphill positive linear relationship with a correlation coefficient of 0.34
  • Age has a weak uphill positive linear relationship with a correlation coefficient of 0.23

This makes sense. A person working more hours per week would likely earn more. Similarly, a person with more education would earn more. Age could be a factor too since there won’t be many 17-year olds earning more than $50k a year.

Let’s further analyze these variables.

Age vs Income

We can observe that the median age for people earning more than $50k is significantly greater than the median age of people earning less than $50k. So, older people are more likely to earn more than $50k a year as compared to their younger counterparts.

Let’s move to the relationship of hours worked per week with income.

Hours worked per week vs Income

This plot shows that the people who put more time per week into their work, appear to earn more. Additionally, there are many outliers in both groups which represent high variations in both the groups (which makes sense, since there are some hectic jobs that pay less and other easy jobs that generally pay more). The interquartile range is also observed to be much smaller for those who earn less. That means that people who earn less than $50k per year have less spread in the hours they work per week.

Moving on, we observe the trend of income against the education level which gives us interesting results.

Education:

People with a college degree earn more than people without one
Rate of proportion earning >$50k increases after 12 years of education

We see two interesting things here:

  1. Only a small proportion of people with less than 12 years of education earn more than $50k a year. This proportion increases almost linearly after 12 years of education.
  2. The intersection of the two lines indicates that after 14 years of education, more than 50% of the people earn >$50k a year.

So, apart from a college degree, what else can you do to increase your chances of earning >$50k?

Occupation and Workclass:

Almost 50% of Executive Managers and Professors earn >$50k

Start by going for an executive managerial position or becoming a professor (hopefully with a specialization).

We know what occupation to choose, now let’s see which workclass earns more.

Income vs work class

From the plot, a positive association can be drawn between the people who earn more and people who are self-employed with incorporated businesses or work in bureaucratic positions. Time to start your own business!

You are now the executive manager of your own company, should you also get married?

Marital Status:

From the graph above (income against marital status), we see that the people that are married to those in the armed forces are likely to have an income exceeding $50k. But only 47 people had a spouse in the armed forces, so the sample size is too sample to draw any conclusions.

So, us being the data scientists we are, decided to group the married and unmarried people, for more plausible associations.

(Keep in mind, we did not group the people married and not currently living with their spouse, into the married group.)

Only a small proportion of unmarried people earn >$50k a year

Add being married to the list of things that could possibly lead you to earn >$50k a year (happy wife, happy life?)

Moving on, we do the same for income vs race.

Race:

Income vs Race
White people are almost twice as likely to earn >$50k as compared to non-white people

Next, let’s compare the differences in income between genders.

Gender:

Men are thrice as likely to earn >$50k a year

Men are far more likely to earn more than $50k a year to their female counterparts. More work can be done using this data to analyze the reasons behind this income inequality, but we leave it for some other day.

We have seen the relation of different attributes with the income level. So what combinations of groups will allow you to buy that Lamborghini you’ve always wanted?

The holy-grail:

  • Above Bachelors and hard workers (working more than 40hrs per week)
  • White and male
  • White and self-employed male

So, boys and girls, make sure to be white and male, or white and male AND self-employed to drive that Lamborghini.

If that’s not possible, just get a Master’s degree or above, and work hard. Easy Peasy :)

And viola, this concludes our EDA segment. I hope you guys had fun. Bye. No, wait! Sorry for the bad humor. We’ve got a lot of stuff left, so hold on tight and let’s move onto the machine learning module.

3. Machine Learning:

We performed multiple iterations of machine learning on our data. We added extra columns for the new features (obtained from the combination of columns shown above) and tried dropping columns that seemed related (i.e. education number and education level). However, we did not see a significant difference in our results. Here we show our results on three types of cleaned version of the data: 1. Missing Values Dropped, 2. Identifier for missing values added, and 3. Missing Values Predicted.

We used multiple classifier algorithms for predicting the income level on all three versions of our cleaned data. Below we show the results from the best classifier for each.

  1. Results with dropped missing values:
    AdaBoost Classifier gave the best result for the the data with dropped rows for missing values. Note: In all classifier algorithms, dropping missing values gave the worst accuracy.
Prediction Using AdaBoost on data with dropped rows

2. With an extra column for missing values (Boolean column):
AdaBoost Classifier gave the best results for the the data in which we had added identifier columns for missing values.

Prediction Using AdaBoost on data with identifier columns

3. With predicted missing values (using Classifiers):

We started off with logistic regression, but did not get a decent result, so we had to employ other classifiers;

List of classifiers tested
Accuracy of individual classifiers

So from the plot above, it can be seen that the highest accuracy was obtained from the Gradient Boosting Classifier.

Gradient Boosting Classifier on data with predicted missing values

These results are pretty good as compared to older work done on the data set. The best accuracy that we could find on this data set was 86%. Our model gave a slightly better accuracy. Our precision and recall is not the best, and more work needs to be done to improve them further.

Moving on, it can be seen that on all classifiers, predicting the missing values gave us the best accuracy. However, when using gradient boosting classifier, almost all cleaning methods gave a similar accuracy.

Best accuracy obtained from different methods of dealing with missing values

And finally, ladies and gentlemen, thank you for tuning in, we hope you had a fun and learning experience. Until next time. Godspeed!

;)

--

--