Exploratory and Predictive Data Analysis of the Adult income data-set

Published in

Analytics Vidhya

4 min readJan 21, 2020

This data-set on UCI ML repository which has got demographic details like age,gender,race etc of around 45000 individuals. I’m going to share my approach for predicting whether an individual’s income exceeds 50K or not.

For this project, I used python and Jupyter Notebook. In the beginning, I imported the pertaining libraries (before God created the heaven and the earth). Apart from that, set the size of visualizations as per your choice. . I’ve used a probabilistic method to tackle the categorical variables. The entire code of this project can be found here.

import pandas as pd
import numpy as np
import sklearn as sk
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams["figure.figsize"] = (20,10)

Let’s import the training data-set and get started. We’ll have to add the column labels manually as this data-set doesn’t have one.

data = pd.read_csv('adult.data')
data.info()
data.columns = ['age','work-class','fnlwgt','education','edu-num','marital',            'occup','relatnip','race','sex','gain','loss','hours','citizenship','>50k']

The Output of the above cell. It can be inferred that we don’t have to worry to much about missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
39                32560 non-null int64
 State-gov        32560 non-null object
 77516            32560 non-null int64
 Bachelors        32560 non-null object
 13               32560 non-null int64
 Never-married    32560 non-null object
 Adm-clerical     32560 non-null object
 Not-in-family    32560 non-null object
 White            32560 non-null object
 Male             32560 non-null object
 2174             32560 non-null int64
 0                32560 non-null int64
 40               32560 non-null int64
 United-States    32560 non-null object
 <=50K            32560 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

Since the last column has got an object type, we need to figure out unique label used and then map it to an integer type.

data['>50k'].unique()

The Corresponding Output:

array([' <=50K', ' >50K'], dtype=object)

Now we’ll map the label corresponding to an income less than or equal to 50k to 0 and the other one to 1. We’ll also get the info again and check if the column’s datatype has been changed to int.

data['>50k'] = data['>50k'].map({' <=50K':0,' >50K':1})
data.info()

As you can infer from the output, the datatype of this columns has been changed to 64 bit integer type.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
age            32560 non-null int64
work-class     32560 non-null object
fnlwgt         32560 non-null int64
education      32560 non-null object
edu-num        32560 non-null int64
marital        32560 non-null object
occup          32560 non-null object
relatnip       32560 non-null object
race           32560 non-null object
sex            32560 non-null object
gain           32560 non-null int64
loss           32560 non-null int64
hours          32560 non-null int64
citizenship    32560 non-null object
>50k           32560 non-null int64
dtypes: int64(7), object(8)
memory usage: 3.7+ MB

Now, I wrote a function, aimed at getting the probability of income exceeding 50K, as per any value of categorical variable. For example, what’s the probability of income exceeding 50K for every sex, citizenship status etc. To accomplish this goal, I calculated the mean value of ‘>50k’ column. Since the only possible values are 0 and 1, the mean will correspond to the probability of getting 1.

def rep_mean(var,data):   
   l = list(data[var].unique())
   d = dict()     
   for obj in l:      
      d[obj] = data[data[var]==obj]['>50k'].mean()
   return d

Let’s check this function on work-class columns as it happens to be categorical one.

temp1 = rep_mean('work-class',data)
temp1

The output returns a dictionary.

{' Self-emp-not-inc': 0.2849271940181031,
 ' Private': 0.21867289390200917,
 ' State-gov': 0.27216653816499614,
 ' Federal-gov': 0.38645833333333335,
 ' Local-gov': 0.29479216435738176,
 ' ?': 0.10403050108932461,
 ' Self-emp-inc': 0.557347670250896,
 ' Without-pay': 0.0,
 ' Never-worked': 0.0}

This dictionary can be mapped to the work-class column.

data['work-class'] = data['work-class'].map(temp1)
data.info()

As you can infer from the output, it’s datatype has been changed to 64 bit floating point number.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
age            32560 non-null int64
work-class     32560 non-null float64
fnlwgt         32560 non-null int64
education      32560 non-null object
edu-num        32560 non-null int64
marital        32560 non-null object
occup          32560 non-null object
relatnip       32560 non-null object
race           32560 non-null object
sex            32560 non-null object
gain           32560 non-null int64
loss           32560 non-null int64
hours          32560 non-null int64
citizenship    32560 non-null object
>50k           32560 non-null int64
dtypes: float64(1), int64(7), object(7)
memory usage: 3.7+ MB

I did the same for other categorical, aka ‘object’ columns. The entire implementation can be found here. I created a few dummy variables by adding the values of few columns and summing the product of probabilities and the correlation coefficient.

data['net'] = data['gain']+data['loss']
x = data.corr() 
x

I chose not to display the correlation table here as it wasn’t rendering clearly. I grouped these variables together because they were were fairly correlated with one other.

data['thresh1'] = 1000*(0.368866*data['education']+0.351885*data['occup'])data['thresh2'] = 1000*(0.447396*data['marital']+0.453578*data['relatnip'])

I tested with a lot of algorithms. I choose Gradient Boosting at last, even though it didn’t perform well on the training data. However, it doesn’t suffer from of over fitting, unlike Decision Trees and Random Forest. The entire code and output can be found here on Github.

Exploratory and Predictive Data Analysis of the Adult income data-set

Written by Kumar Shashwat