Machine learning for Income Prediction?

Tarun Kumar
9 min readFeb 4, 2022

--

A Machine Learning Case study using random forest.

In this project, we are going to predict whether a person’s income is above 50k or below 50k using various features like age, education, and occupation. The dataset we are going to use is the Adult census income dataset from Kaggle which contains about 32561 rows and 15 features that can be downloaded here. We will also build a web application using Flask and deploy using Heroku.

Photo by Alexander Mils on Unsplash

Photo by rupixen.com on Unsplash

Motivation:

Building such predictive models can help us better understand the population of a country as well as the various factors affecting the economy. Governments can understand such factors and improve upon them leading to the growth of the country.

Understanding the problem:

The dataset contains the labels which we have to predict which is the dependent feature ‘Income level’. This feature is discrete consisting of two categories income less than 50k and more than 50k. So the problem we have is a Supervised Binary Classification type.

Step 0: Import libraries and dataset

All the standard libraries like numpy, pandas, matplotlib, and seaborn are imported in this step. We use numpy for linear algebra operations, pandas for using data frames, matplotlib, and seaborn for plotting graphs. The dataset is imported using the pandas command read_csv().

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns# Importing dataset
dataset = pd.read_csv('adult.csv')

Step 1: Descriptive analysis

# Preview dataset
dataset.head()
# Shape of dataset
print('Rows: {} Columns: {}'.format(dataset.shape[0], dataset.shape[1]))Output:
Rows: 32561 Columns: 15# Features data-type
dataset.info()Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education.num 32561 non-null int64
5 marital.status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital.gain 32561 non-null int64
11 capital.loss 32561 non-null int64
12 hours.per.week 32561 non-null int64
13 native.country 32561 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB# Statistical summary
dataset.describe().T
# Check for null values
round((dataset.isnull().sum() / dataset.shape[0]) * 100, 2).astype(str) + ' %'Output:
age 0.0 %
workclass 0.0 %
fnlwgt 0.0 %
education 0.0 %
education.num 0.0 %
marital.status 0.0 %
occupation 0.0 %
relationship 0.0 %
race 0.0 %
sex 0.0 %
capital.gain 0.0 %
capital.loss 0.0 %
hours.per.week 0.0 %
native.country 0.0 %
income 0.0 %
dtype: object# Check for '?' in dataset
round((dataset.isin(['?']).sum() / dataset.shape[0]) * 100, 2).astype(str) + ' %'Output:
age 0.0 %
workclass 5.64 %
fnlwgt 0.0 %
education 0.0 %
education.num 0.0 %
marital.status 0.0 %
occupation 5.66 %
relationship 0.0 %
race 0.0 %
sex 0.0 %
capital.gain 0.0 %
capital.loss 0.0 %
hours.per.week 0.0 %
native.country 1.79 %
income 0.0 %
dtype: object# Checking the counts of label categories
income = dataset['income'].value_counts(normalize=True)
round(income * 100, 2).astype('str') + ' %'Output:
<=50K 75.92 %
>50K 24.08 %
Name: income, dtype: object

Observations:

1. The dataset doesn’t have any null values, but it contains missing values in the form of ‘?’ which needs to be preprocessed.

2. The dataset is unbalanced, as the dependent feature ‘income’ contains 75.92% values have income less than 50k, and 24.08% values have income more than 50k.

Step 2: Exploratory Data Analysis

2.1 Univariate Analysis:

2.2 Bivariate Analysis:

2.3 Multivariate Analysis:

Pair plot of dataset
Heatmap of the correlation matrix

Observations:

1. In this dataset, the most number of people are young, white, male, high school graduates with 9 to 10 years of education, and work 40 hours per week.

2. From the correlation heatmap, we can see that the dependent feature ‘income’ is highly correlated with age, numbers of years of education, capital gain, and the number of hours per week.

Step 3: Data preprocessing

The null values are in the form of ‘?’ which can be easily replaced with the most frequent value(mode) using the fillna() command.

dataset = dataset.replace('?', np.nan)# Checking null values
round((dataset.isnull().sum() / dataset.shape[0]) * 100, 2).astype(str) + ' %'Output:
age 0.0 %
workclass 5.64 %
fnlwgt 0.0 %
education 0.0 %
education.num 0.0 %
marital.status 0.0 %
occupation 5.66 %
relationship 0.0 %
race 0.0 %
sex 0.0 %
capital.gain 0.0 %
capital.loss 0.0 %
hours.per.week 0.0 %
native.country 1.79 %
income 0.0 %
dtype: objectcolumns_with_nan = ['workclass', 'occupation', 'native.country']for col in columns_with_nan:
dataset[col].fillna(dataset[col].mode()[0], inplace = True)

The object columns in the dataset need to be encoded so that they can be further used. This can be done using Label Encoder in the sklearn’s preprocessing library.

from sklearn.preprocessing import LabelEncoderfor col in dataset.columns:
if dataset[col].dtypes == 'object':
encoder = LabelEncoder()
dataset[col] = encoder.fit_transform(dataset[col])

The dataset is then split into X which contains all the independent features and Y which contains the dependent feature ‘Income’.

X = dataset.drop('income', axis = 1) 
Y = dataset['income']

The curse of multicollinearity and the problem of overfitting can be solved by performing Feature Selection. The feature importances can be easily found by using the ExtraTreesClassifier.

from sklearn.ensemble import ExtraTreesClassifier
selector = ExtraTreesClassifier(random_state = 42)selector.fit(X, Y)feature_imp = selector.feature_importances_for index, val in enumerate(feature_imp):
print(index, round((val * 100), 2))Output:
0 15.59
1 4.13
2 16.71
3 3.87
4 8.66
5 8.04
6 7.27
7 8.62
8 1.47
9 2.84
10 8.83
11 2.81
12 9.64
13 1.53X = X.drop(['workclass', 'education', 'race', 'sex', 'capital.loss', 'native.country'], axis = 1)

Using Feature Scaling we can standardize the dataset to help the model learn the patterns. This can be done with StandardScaler() from sklearn’s preprocessing library.

from sklearn.preprocessing import StandardScalerfor col in X.columns:     
scaler = StandardScaler()
X[col] = scaler.fit_transform(X[col].values.reshape(-1, 1))

The dependent feature ‘Income’ is highly imbalanced as 75.92% values have income less than 50k and 24.08% values have income more than 50k. This needs to be fixed as it results in a low F1 score. As we have a small dataset we can perform Oversampling using a technique like RandomOverSampler.

round(Y.value_counts(normalize=True) * 100, 2).astype('str') + ' %'Output:
0 75.92 %
1 24.08 %
Name: income, dtype: objectfrom imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state = 42)ros.fit(X, Y)X_resampled, Y_resampled = ros.fit_resample(X, Y)round(Y_resampled.value_counts(normalize=True) * 100, 2).astype('str') + ' %'Output:
1 50.0 %
0 50.0 %
Name: income, dtype: object

The dataset is split into training data and testing data in the ratio 80:20 using the train_test_split() command.

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_resampled, Y_resampled, test_size = 0.2, random_state = 42)print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("Y_train shape:", Y_train.shape)
print("Y_test shape:", Y_test.shape)Output:
X_train shape: (39552, 8)
X_test shape: (9888, 8)
Y_train shape: (39552,)
Y_test shape: (9888,)

Step 4: Data Modelling

Random Forest Classifier:

from sklearn.ensemble import RandomForestClassifier
ran_for = RandomForestClassifier(random_state = 42)ran_for.fit(X_train, Y_train)Y_pred_ran_for = ran_for.predict(X_test)

Understanding the Algorithm:

Random forest is a Supervised learning algorithm that is used for both classification and regression. It is a type of bagging ensemble algorithm, which creates multiple decision trees simultaneously trying to learn from the dataset independent of one another. The final prediction is selected using majority voting.

Random forests are very flexible and give high accuracy as it overcomes the problem of overfitting by combining the results of multiple decision trees. Even for large datasets, random forests give a good performance. They also give good accuracy if our dataset has a large number of missing values. But random forests are more complex and computationally intensive than decision trees resulting in a time-consuming model building process. They are also harder to interpret and less intuitive than a decision tree.

This algorithm has some important parameters like max_depth, max_features, n_estimators, and min_sample_leaf. The number of trees which can be used to build the model is defined by n_estimators. Max_features determines the maximum number of features the random forest can use in an individual tree. The maximum depth of the decision trees is given by the parameter max_depth. The minimum number of samples required at a leaf node is given by min_sample_leaf.

Step 5: Model Evaluation

In this step, we will evaluate our model using two metrics which are accuracy_score and f1_score. Accuracy is the ratio of correct predicted values over the total predicted values. It tells us how accurate our prediction is. F1 score is the weighted average of precision and recall and higher its value better the model. We will use the accuracy score with f1 score as we have an imbalanced dataset.

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_scoreprint('Random Forest Classifier:')
print('Accuracy score:',round(accuracy_score(Y_test, Y_pred_ran_for) * 100, 2))
print('F1 score:',round(f1_score(Y_test, Y_pred_ran_for) * 100, 2))Output:
Random Forest Classifier:
Accuracy score: 92.6
F1 score: 92.93

Step 6: Hyperparameter Tuning

We will tune the hyperparameters of our random forest classifier using RandomizedSearchCV which finds the best hyperparameters by searching randomly avoiding unnecessary computation. We will try to find the best values for ‘n_estimators’ and ‘max_depth’.

from sklearn.model_selection import RandomizedSearchCVn_estimators = [int(x) for x in np.linspace(start = 40, stop = 150, num = 15)]
max_depth = [int(x) for x in np.linspace(40, 150, num = 15)]param_dist = {
'n_estimators' : n_estimators,
'max_depth' : max_depth,
}rf_tuned = RandomForestClassifier(random_state = 42)rf_cv = RandomizedSearchCV(estimator = rf_tuned, param_distributions = param_dist, cv = 5, random_state = 42)rf_cv.fit(X_train, Y_train)rf_cv.best_score_Output:
0.9131271105332539rf_cv.best_params_Output:
{'n_estimators': 40, 'max_depth': 102}rf_best = RandomForestClassifier(max_depth = 102, n_estimators = 40, random_state = 42)rf_best.fit(X_train, Y_train)Y_pred_rf_best = rf_best.predict(X_test)print('Random Forest Classifier:')
print('Accuracy score:',round(accuracy_score(Y_test, Y_pred_rf_best) * 100, 2))
print('F1 score:',round(f1_score(Y_test, Y_pred_rf_best) * 100, 2))Output:
Random Forest Classifier:
Accuracy score: 92.77
F1 score: 93.08from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred_rf_best)
Heatmap of the confusion matrix
from sklearn.metrics import classification_report
print(classification_report(Y_test, Y_pred_rf_best))Output: precision recall f1-score support
0 0.97 0.88 0.92 4955
1 0.89 0.98 0.93 4933
accuracy 0.93 9888
macro avg 0.93 0.93 0.93 9888
weighted avg 0.93 0.93 0.93 9888

The model gives us the best values for an accuracy score of 92.77 and f1 score of 93.08 after tuning its hyperparameters.

Step 7: Model Deployment

For deploying our model we built a web application using the Flask micro framework.

Flask WebApp

Future work:

  • We have a large enough dataset, so we can use neural networks such as an artificial neural network to build a model that can result in better performance.

--

--

Tarun Kumar

Perpetual Learner, Fitness enthusiast, Passionate explorer..