Machine learning for Income Prediction?

9 min readFeb 4, 2022

A Machine Learning Case study using random forest.

In this project, we are going to predict whether a person’s income is above 50k or below 50k using various features like age, education, and occupation. The dataset we are going to use is the Adult census income dataset from Kaggle which contains about 32561 rows and 15 features that can be downloaded here. We will also build a web application using Flask and deploy using Heroku.

Photo by rupixen.com on Unsplash

Motivation:

Building such predictive models can help us better understand the population of a country as well as the various factors affecting the economy. Governments can understand such factors and improve upon them leading to the growth of the country.

Understanding the problem:

The dataset contains the labels which we have to predict which is the dependent feature ‘Income level’. This feature is discrete consisting of two categories income less than 50k and more than 50k. So the problem we have is a Supervised Binary Classification type.

Step 0: Import libraries and dataset

All the standard libraries like numpy, pandas, matplotlib, and seaborn are imported in this step. We use numpy for linear algebra operations, pandas for using data frames, matplotlib, and seaborn for plotting graphs. The dataset is imported using the pandas command read_csv().

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns# Importing dataset
dataset = pd.read_csv('adult.csv')

Step 1: Descriptive analysis

# Preview dataset
dataset.head()

# Shape of dataset
print('Rows: {} Columns: {}'.format(dataset.shape[0], dataset.shape[1]))Output:
Rows: 32561 Columns: 15# Features data-type 
dataset.info()Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB# Statistical summary
dataset.describe().T

# Check for null values
round((dataset.isnull().sum() / dataset.shape[0]) * 100, 2).astype(str) + ' %'Output:
age               0.0 %
workclass         0.0 %
fnlwgt            0.0 %
education         0.0 %
education.num     0.0 %
marital.status    0.0 %
occupation        0.0 %
relationship      0.0 %
race              0.0 %
sex               0.0 %
capital.gain      0.0 %
capital.loss      0.0 %
hours.per.week    0.0 %
native.country    0.0 %
income            0.0 %
dtype: object# Check for '?' in dataset
round((dataset.isin(['?']).sum() / dataset.shape[0]) * 100, 2).astype(str) + ' %'Output:
age                0.0 %
workclass         5.64 %
fnlwgt             0.0 %
education          0.0 %
education.num      0.0 %
marital.status     0.0 %
occupation        5.66 %
relationship       0.0 %
race               0.0 %
sex                0.0 %
capital.gain       0.0 %
capital.loss       0.0 %
hours.per.week     0.0 %
native.country    1.79 %
income             0.0 %
dtype: object# Checking the counts of label categories
income = dataset['income'].value_counts(normalize=True)
round(income * 100, 2).astype('str') + ' %'Output:
<=50K    75.92 %
>50K     24.08 %
Name: income, dtype: object

Observations:

1. The dataset doesn’t have any null values, but it contains missing values in the form of ‘?’ which needs to be preprocessed.
2. The dataset is unbalanced, as the dependent feature ‘income’ contains 75.92% values have income less than 50k, and 24.08% values have income more than 50k.

Step 2: Exploratory Data Analysis

2.1 Univariate Analysis:

2.2 Bivariate Analysis:

2.3 Multivariate Analysis:

Observations:

1. In this dataset, the most number of people are young, white, male, high school graduates with 9 to 10 years of education, and work 40 hours per week.
2. From the correlation heatmap, we can see that the dependent feature ‘income’ is highly correlated with age, numbers of years of education, capital gain, and the number of hours per week.

Step 3: Data preprocessing

The null values are in the form of ‘?’ which can be easily replaced with the most frequent value(mode) using the fillna() command.

dataset = dataset.replace('?', np.nan)# Checking null values
round((dataset.isnull().sum() / dataset.shape[0]) * 100, 2).astype(str) + ' %'Output:
age                0.0 %
workclass         5.64 %
fnlwgt             0.0 %
education          0.0 %
education.num      0.0 %
marital.status     0.0 %
occupation        5.66 %
relationship       0.0 %
race               0.0 %
sex                0.0 %
capital.gain       0.0 %
capital.loss       0.0 %
hours.per.week     0.0 %
native.country    1.79 %
income             0.0 %
dtype: objectcolumns_with_nan = ['workclass', 'occupation', 'native.country']for col in columns_with_nan:
    dataset[col].fillna(dataset[col].mode()[0], inplace = True)

The object columns in the dataset need to be encoded so that they can be further used. This can be done using Label Encoder in the sklearn’s preprocessing library.

from sklearn.preprocessing import LabelEncoderfor col in dataset.columns:
  if dataset[col].dtypes == 'object':         
    encoder = LabelEncoder()         
    dataset[col] = encoder.fit_transform(dataset[col])

The dataset is then split into X which contains all the independent features and Y which contains the dependent feature ‘Income’.

X = dataset.drop('income', axis = 1) 
Y = dataset['income']

The curse of multicollinearity and the problem of overfitting can be solved by performing Feature Selection. The feature importances can be easily found by using the ExtraTreesClassifier.

from sklearn.ensemble import ExtraTreesClassifier
selector = ExtraTreesClassifier(random_state = 42)selector.fit(X, Y)feature_imp = selector.feature_importances_for index, val in enumerate(feature_imp):
    print(index, round((val * 100), 2))Output:
0 15.59
1 4.13
2 16.71
3 3.87
4 8.66
5 8.04
6 7.27
7 8.62
8 1.47
9 2.84
10 8.83
11 2.81
12 9.64
13 1.53X = X.drop(['workclass', 'education', 'race', 'sex', 'capital.loss', 'native.country'], axis = 1)

Using Feature Scaling we can standardize the dataset to help the model learn the patterns. This can be done with StandardScaler() from sklearn’s preprocessing library.

from sklearn.preprocessing import StandardScalerfor col in X.columns:     
  scaler = StandardScaler()     
  X[col] = scaler.fit_transform(X[col].values.reshape(-1, 1))

The dependent feature ‘Income’ is highly imbalanced as 75.92% values have income less than 50k and 24.08% values have income more than 50k. This needs to be fixed as it results in a low F1 score. As we have a small dataset we can perform Oversampling using a technique like RandomOverSampler.

round(Y.value_counts(normalize=True) * 100, 2).astype('str') + ' %'Output:
0    75.92 %
1    24.08 %
Name: income, dtype: objectfrom imblearn.over_sampling import RandomOverSampler 
ros = RandomOverSampler(random_state = 42)ros.fit(X, Y)X_resampled, Y_resampled = ros.fit_resample(X, Y)round(Y_resampled.value_counts(normalize=True) * 100, 2).astype('str') + ' %'Output:
1    50.0 %
0    50.0 %
Name: income, dtype: object

The dataset is split into training data and testing data in the ratio 80:20 using the train_test_split() command.

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_resampled, Y_resampled, test_size = 0.2, random_state = 42)print("X_train shape:", X_train.shape) 
print("X_test shape:", X_test.shape) 
print("Y_train shape:", Y_train.shape) 
print("Y_test shape:", Y_test.shape)Output:
X_train shape: (39552, 8)
X_test shape: (9888, 8)
Y_train shape: (39552,)
Y_test shape: (9888,)

Step 4: Data Modelling

Random Forest Classifier:

from sklearn.ensemble import RandomForestClassifier
ran_for = RandomForestClassifier(random_state = 42)ran_for.fit(X_train, Y_train)Y_pred_ran_for = ran_for.predict(X_test)

Understanding the Algorithm:

Random forest is a Supervised learning algorithm that is used for both classification and regression. It is a type of bagging ensemble algorithm, which creates multiple decision trees simultaneously trying to learn from the dataset independent of one another. The final prediction is selected using majority voting.

Random forests are very flexible and give high accuracy as it overcomes the problem of overfitting by combining the results of multiple decision trees. Even for large datasets, random forests give a good performance. They also give good accuracy if our dataset has a large number of missing values. But random forests are more complex and computationally intensive than decision trees resulting in a time-consuming model building process. They are also harder to interpret and less intuitive than a decision tree.

This algorithm has some important parameters like max_depth, max_features, n_estimators, and min_sample_leaf. The number of trees which can be used to build the model is defined by n_estimators. Max_features determines the maximum number of features the random forest can use in an individual tree. The maximum depth of the decision trees is given by the parameter max_depth. The minimum number of samples required at a leaf node is given by min_sample_leaf.

Step 5: Model Evaluation

In this step, we will evaluate our model using two metrics which are accuracy_score and f1_score. Accuracy is the ratio of correct predicted values over the total predicted values. It tells us how accurate our prediction is. F1 score is the weighted average of precision and recall and higher its value better the model. We will use the accuracy score with f1 score as we have an imbalanced dataset.

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_scoreprint('Random Forest Classifier:')
print('Accuracy score:',round(accuracy_score(Y_test, Y_pred_ran_for) * 100, 2))
print('F1 score:',round(f1_score(Y_test, Y_pred_ran_for) * 100, 2))Output:
Random Forest Classifier:
Accuracy score: 92.6
F1 score: 92.93

Step 6: Hyperparameter Tuning

We will tune the hyperparameters of our random forest classifier using RandomizedSearchCV which finds the best hyperparameters by searching randomly avoiding unnecessary computation. We will try to find the best values for ‘n_estimators’ and ‘max_depth’.

from sklearn.model_selection import RandomizedSearchCVn_estimators = [int(x) for x in np.linspace(start = 40, stop = 150, num = 15)]
max_depth = [int(x) for x in np.linspace(40, 150, num = 15)]param_dist = {
    'n_estimators' : n_estimators,
    'max_depth' : max_depth,
}rf_tuned = RandomForestClassifier(random_state = 42)rf_cv = RandomizedSearchCV(estimator = rf_tuned, param_distributions = param_dist, cv = 5, random_state = 42)rf_cv.fit(X_train, Y_train)rf_cv.best_score_Output:
0.9131271105332539rf_cv.best_params_Output:
{'n_estimators': 40, 'max_depth': 102}rf_best = RandomForestClassifier(max_depth = 102, n_estimators = 40, random_state = 42)rf_best.fit(X_train, Y_train)Y_pred_rf_best = rf_best.predict(X_test)print('Random Forest Classifier:') 
print('Accuracy score:',round(accuracy_score(Y_test, Y_pred_rf_best) * 100, 2)) 
print('F1 score:',round(f1_score(Y_test, Y_pred_rf_best) * 100, 2))Output:
Random Forest Classifier:
Accuracy score: 92.77
F1 score: 93.08from sklearn.metrics import confusion_matrix 
cm = confusion_matrix(Y_test, Y_pred_rf_best)

from sklearn.metrics import classification_report
print(classification_report(Y_test, Y_pred_rf_best))Output:               precision    recall   f1-score  support           0       0.97      0.88      0.92      4955
           1       0.89      0.98      0.93      4933    accuracy                           0.93      9888
   macro avg       0.93      0.93      0.93      9888
weighted avg       0.93      0.93      0.93      9888

The model gives us the best values for an accuracy score of 92.77 and f1 score of 93.08 after tuning its hyperparameters.

Step 7: Model Deployment

For deploying our model we built a web application using the Flask micro framework.

Future work:

We have a large enough dataset, so we can use neural networks such as an artificial neural network to build a model that can result in better performance.

Machine learning for Income Prediction?

A Machine Learning Case study using random forest.

Motivation:

Understanding the problem:

Step 0: Import libraries and dataset

Step 1: Descriptive analysis

Observations:

Step 2: Exploratory Data Analysis

Observations:

Step 3: Data preprocessing

Step 4: Data Modelling

Understanding the Algorithm:

Step 5: Model Evaluation

Step 6: Hyperparameter Tuning

Step 7: Model Deployment

Future work:

Written by Tarun Kumar