Adult Census Income dataset: Using multiple machine learning models

Ada Johnson
Analytics Vidhya
Published in
6 min readSep 27, 2020

We have all heard that data science is the ‘sexiest job of the 21st century’. Hence, it is also surprising to know that before the world was over-populated with data, the concept of neural networks was laid down half a century ago. Even before the word ‘machine learning’ was coined, Donald Hebb in his book ‘The Organization of Behavior’ created a model based on brain cell interaction in 1949. The book presents Hebb’s theories on neuron excitement and communication between neurons.

Hebb wrote, “When one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell.” Translating Hebb’s concepts to artificial neural networks and artificial neurons, his model can be described as a way of altering the relationships between artificial neurons (also referred to as nodes) and the changes to individual neurons. Arthur Samuel of IBM first came up with the phrase “Machine Learning” in 1952.

Analyzing the data

The dataset named Adult Census Income is available in kaggle and UCI repository. This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). The prediction task is to determine whether a person makes over $50K a year or not.

Dataset: https://www.kaggle.com/uciml/adult-census-income

Using the python language and several visualizations, I have attempted to fit 4 machine learning models and find the best model to describe the data.

There are 3 steps to working with data- Data, Discovery, Deployment

DATA

age workclass  fnlwgt     education  education.num marital.status  0   90         ?   77053       HS-grad              9        Widowed   1   82   Private  132870       HS-grad              9        Widowed   2   66         ?  186061  Some-college             10        Widowed   3   54   Private  140359       7th-8th              4       Divorced   4   41   Private  264663  Some-college             10      Separated             occupation   relationship   race     sex  capital.gain  0                  ?  Not-in-family  White  Female             0   1    Exec-managerial  Not-in-family  White  Female             0   2                  ?      Unmarried  Black  Female             0   3  Machine-op-inspct      Unmarried  White  Female             0   4     Prof-specialty      Own-child  White  Female             0      capital.loss  hours.per.week native.country income  0          4356              40  United-States  <=50K  1          4356              18  United-States  <=50K  2          4356              40  United-States  <=50K  3          3900              40  United-States  <=50K  4          3900              40  United-States  <=50K

DISCOVERY

Data preprocessing

Photo by Isaac Smith on Unsplash

The discovery phase is where we attempt to understand the data. It might require cleaning, transformation, integration. The following code snippet highlights the data preprocessing steps.

The dataset contained null values, both numerical and categorical values. The categorical values were both nominal and ordinal. The data had redundant columns as well.

Since the missing values were represented by ‘?’ , they were replaced by NAN values and removed after detection. The dependent column, ‘income’ which is to be predicted has been replaced with 0 and 1 and hence convert the problem to a dichotomous classification problem. There was one redundant column, ‘education.num’ which was an ordinal representation of ‘education’, which is removed above.

Now that unnecessary data points and redundant attributes have been removed, it is necessary to select the set of attributes really contributing to the prediction of the income.

To check the correlation between a binary variable and continuous variables, the point biserial correlation has been used. After appropriate application of the test, ‘fnlwgt’ has been dropped which showed negative correlation.

For feature selection, all the numerical columns are selected except ‘fnlwgt’. For categorical variables, chi-square estimate is used. Chi-square estimate is used to measure the correlation between 2 categorical variables.

First, the categorical variables are encoded or rather dummies are generated and the numerical values are normalized to be between [0,1]. It’s simply a case of getting all your data on the same scale: if the scales for different features are wildly different, this can have a knock-on effect on your ability to learn (depending on what methods you’re using to do it). Ensuring standardized feature values implicitly weights all features equally in their representation.

There were 103 attributes including numerical variables. After feature selection, there are 65 attributes.

This dataset contains a typical example of class imbalance. It is shown in the following charts.

The pie chart clearly denotes that more than 50% of the dataset is occupied by one type of observation. This problem is handled using SMOTE(Synthetic Minority Oversampling Technique).

DEPLOYMENT

As mentioned above, 4 models are shown below. The training and testing is divided in 80–20 for logistic and naive bayes whereas 70–30 for decision tree and random forest.

Logistic Regression

The sigmoid function in orange

The foremost model to predict a dichotomous variable is logistic regression. The logistic function is a sigmoid function, which takes any real input t, and outputs a value between zero and one. It gives the probability.

After fitting the model, we find the model accuracy. I generated the confusion matrix and it does somewhat good. We will compare all the models in the end.

Naive Bayes

A naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. Basically, it’s “naive” because it makes assumptions that may or may not turn out to be correct.

Decision Tree

A decision tree is a branched flowchart showing multiple pathways for potential decisions and outcomes. The tree starts with what is called a decision node, which signifies that a decision must be made. From the decision node, a branch is created for each of the alternative choices under consideration.

Random Forest

Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. The basic principle is that a group of “weak learners” can come together to form a “strong learner”.

I have used one model accuracy measure to form a comparative study between all models. To construct the ROC curve the following code is thus.

A comparative study of the above models with respect to accuracy, precision, recall, ROC score is computed together for better decision.

From the table above, random forest gives the best accuracy and ROC score.

All the ROC curves is shown below.

Random forest covers the maximum area and hence is a better model. I have not tried neural networks on this problem as there were only 30K plus data points I felt it would overfit the data. To further improve, more complex ensemble methods can be used. Also, according to Ockham’s Razor “the simplest explanation is most likely the right one”.

Detailed report on the project is available in my kaggle notebook.

If you like this, you might like my other articles. Do check out.

Please let me know if there is any part I could have done better.

Thanks for Reading!!

--

--