Exploratory and Predictive Data Analysis of the Adult income data-set

Kumar Shashwat
Analytics Vidhya
Published in
4 min readJan 21, 2020

This data-set on UCI ML repository which has got demographic details like age,gender,race etc of around 45000 individuals. I’m going to share my approach for predicting whether an individual’s income exceeds 50K or not.

For this project, I used python and Jupyter Notebook. In the beginning, I imported the pertaining libraries (before God created the heaven and the earth). Apart from that, set the size of visualizations as per your choice. . I’ve used a probabilistic method to tackle the categorical variables. The entire code of this project can be found here.

import pandas as pd
import numpy as np
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (20,10)

Let’s import the training data-set and get started. We’ll have to add the column labels manually as this data-set doesn’t have one.

data = pd.read_csv('adult.data')
data.info()
data.columns = ['age','work-class','fnlwgt','education','edu-num','marital', 'occup','relatnip','race','sex','gain','loss','hours','citizenship','>50k']

The Output of the above cell. It can be inferred that we don’t have to worry to much about missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
39 32560 non-null int64
State-gov 32560 non-null object
77516 32560 non-null int64
Bachelors 32560 non-null object
13 32560 non-null int64
Never-married 32560 non-null object
Adm-clerical 32560 non-null object
Not-in-family 32560 non-null object
White 32560 non-null object
Male 32560 non-null object
2174 32560 non-null int64
0 32560 non-null int64
40 32560 non-null int64
United-States 32560 non-null object
<=50K 32560 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

Since the last column has got an object type, we need to figure out unique label used and then map it to an integer type.

data['>50k'].unique()

The Corresponding Output:

array([' <=50K', ' >50K'], dtype=object)

Now we’ll map the label corresponding to an income less than or equal to 50k to 0 and the other one to 1. We’ll also get the info again and check if the column’s datatype has been changed to int.

data['>50k'] = data['>50k'].map({' <=50K':0,' >50K':1})
data.info()

As you can infer from the output, the datatype of this columns has been changed to 64 bit integer type.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
age 32560 non-null int64
work-class 32560 non-null object
fnlwgt 32560 non-null int64
education 32560 non-null object
edu-num 32560 non-null int64
marital 32560 non-null object
occup 32560 non-null object
relatnip 32560 non-null object
race 32560 non-null object
sex 32560 non-null object
gain 32560 non-null int64
loss 32560 non-null int64
hours 32560 non-null int64
citizenship 32560 non-null object
>50k 32560 non-null int64
dtypes: int64(7), object(8)
memory usage: 3.7+ MB

Now, I wrote a function, aimed at getting the probability of income exceeding 50K, as per any value of categorical variable. For example, what’s the probability of income exceeding 50K for every sex, citizenship status etc. To accomplish this goal, I calculated the mean value of ‘>50k’ column. Since the only possible values are 0 and 1, the mean will correspond to the probability of getting 1.

def rep_mean(var,data):   
l = list(data[var].unique())
d = dict()
for obj in l:
d[obj] = data[data[var]==obj]['>50k'].mean()
return d

Let’s check this function on work-class columns as it happens to be categorical one.

temp1 = rep_mean('work-class',data)
temp1

The output returns a dictionary.

{' Self-emp-not-inc': 0.2849271940181031,
' Private': 0.21867289390200917,
' State-gov': 0.27216653816499614,
' Federal-gov': 0.38645833333333335,
' Local-gov': 0.29479216435738176,
' ?': 0.10403050108932461,
' Self-emp-inc': 0.557347670250896,
' Without-pay': 0.0,
' Never-worked': 0.0}

This dictionary can be mapped to the work-class column.

data['work-class'] = data['work-class'].map(temp1)
data.info()

As you can infer from the output, it’s datatype has been changed to 64 bit floating point number.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
age 32560 non-null int64
work-class 32560 non-null float64
fnlwgt 32560 non-null int64
education 32560 non-null object
edu-num 32560 non-null int64
marital 32560 non-null object
occup 32560 non-null object
relatnip 32560 non-null object
race 32560 non-null object
sex 32560 non-null object
gain 32560 non-null int64
loss 32560 non-null int64
hours 32560 non-null int64
citizenship 32560 non-null object
>50k 32560 non-null int64
dtypes: float64(1), int64(7), object(7)
memory usage: 3.7+ MB

I did the same for other categorical, aka ‘object’ columns. The entire implementation can be found here. I created a few dummy variables by adding the values of few columns and summing the product of probabilities and the correlation coefficient.

data['net'] = data['gain']+data['loss']
x = data.corr()
x

I chose not to display the correlation table here as it wasn’t rendering clearly. I grouped these variables together because they were were fairly correlated with one other.

data['thresh1'] = 1000*(0.368866*data['education']+0.351885*data['occup'])data['thresh2'] = 1000*(0.447396*data['marital']+0.453578*data['relatnip'])

I tested with a lot of algorithms. I choose Gradient Boosting at last, even though it didn’t perform well on the training data. However, it doesn’t suffer from of over fitting, unlike Decision Trees and Random Forest. The entire code and output can be found here on Github.

--

--

Kumar Shashwat
Analytics Vidhya

A sensitive, introverted boy who loves to wander alone in this big scary world and finds an excuse to have a cup of ice-cream during winters.