Income Classification — Data Science Project guide 2

Laxmi Kumari
11 min readJul 18, 2022

--

Welcome to the third data science project guide in our series. In this project we will be working on the census data.

Introduction

Economic census data is extremely important as trade associations, chambers of commerce, and businesses rely on this information for economic development, business decisions, and strategic planning. Source: https://www.census.gov/programs-surveys/economic-census/guidance/data-uses.html

There are many use-cases directly linked with economic census data that you can find on the above website. These range from opening a new small business, expanding existing business to re-evaluating an existing larger business based on the median household income. Governments also use these income census data for promoting economic development and shaping tax policies. So, we will be using the 1994 income census data to build a classification model which can be leveraged by businesses and governments to solve multiple business problems mentioned above.

About the data

This data set is taken from UCI Machine Learning Depository. Source:https://archive.ics.uci.edu/ml/datasets/Census+Income. It is also made available on the GitHub. There are 14 columns and 48842 instance with different factors affecting the column income . Out of the 14 independent variables, 8 are categorical/ordinal.

Datatframe — head

Exploratory Data Analysis (EDA)

EDA is one of the most important part of any data science project and there are a number of things to be carried out in EDA. However, we will see a few in this blog.

  1. Data Cleaning:

The first step of any data science project is to inspect and clean the data so that it can be fed into machine learning (ML) models. It was observed that there are some ? entries in the dataframe. In order to identify or handle them, we can replace all the ? entries with numpy nan as shown below:

income_data=income_data.replace(to_replace="?",value=np.nan)

Once the ? values are replaced, you can observe the number of missing values in the columns using:

income_data.info()

You can observe the output from the above code block to identify the columns with missing values. Once the columns are identified, you can impute those missing values with mode for categorical columns as shown here:

Workclass — frequency dist

The frequency distribution of workclass column gives you the mode of the column with highest frequency which is Private in this case. You can get the number of missing values in any column using isna().sum() method in pandas:

workclass — missing values

There are 2799 missing values in the workclass column which can be imputed using the fillna method in pandas as shown below:

income_data['workclass'].fillna('Private', inplace = True)

The steps for the cleaning the other columns are not shown here but are available on GitHub.

2. Data Visualization:

Visualizing data gives us insights into the data which can be used for making inferences as well as in the model building. We will write a custom function using matplotlib and seaborn libraries to plot histogram of columns against the target variable income which has been transformed into 0 for income ≤ 50K and 1 for income >50K.

import matplotlib.pyplot as plt
import seaborn as sns
def plotting(column):
if income_data[column].dtype != 'int64':
f, axes = plt.subplots(1,1,figsize=(15,5))
sns.countplot(x=column, hue='income_above_50K', data = income_data)
plt.xticks(rotation=90)
plt.suptitle(column,fontsize=20)
plt.show()
else:
g = sns.FacetGrid(income_data, row="income_above_50K", margin_titles=True, aspect=4, height=3)
g.map(plt.hist,column,bins=100)
plt.show()
plt.show()

The custom function defined above plotting can be used with any column in the data (continuous or categorical) to plot histogram. We are showing a couple of them here and the rest can be found GitHub.

education

It can be observed from the above plot that people with Masters and Doctorate have more percentage of people in the Income above 50K category compared to the rest of the education levels which as expected since a Masters and Doctorate degree helps you in getting a better compensation.

occupation

In the above plot, it is clear that Exec-managerial and speciality profession (prof-speciality) has a higher ratio of people in income above 50K to people in income below 50K. So, the other roles have a lower salary as compared to these two white collar jobs.

Feature Engineering

Feature engineering is yet another important step before moving forward to the model building phase. The process of converting raw columns into meaningful features that can be fed into the machine learning (ML) model is called feature engineering. For example, you cannot directly feed a categorical column as neither the computer nor the model is designed to work on characters. Machines understand the language of numbers and all the ML models are based on matrix multiplication and optimization so we need to convert the categorical columns into numerical features. We can transform the categorical columns using two methods shown below:

  1. Mapping categorical data:

Categorical values in the columns can be mapped easily using the map function on the pandas column as shown below:

income_data['is_female']=income_data['sex'].map({'Male':0, 'Female':1})

The Male category of the sex column is mapped to 0 and the Female category is mapped to 1.

2. One hot encoding

We can use one hot encoding to convert categorical columns into numerical features using pandas get_dummies() method. If there are several categories in a column (for example 16 categories in the column education , as you have seen in the visualization section) it is better to reduce the number of categories using the map method before applying one hot encoding.

income_data['education'] = income_data['education'].map(
{'Preschool':'level_1','1st-4th':'level_1','5th-6th':'level_1','7th-8th':'level_1','9th':'level_1','10th':'level_1','11th':'level_1','12th':'level_1','HS-grad':'level_1',
'Prof-school':'level_2','Assoc-acdm':'level_2','Assoc-voc':'level_2','Some-college':'level_2',
'Bachelors':'level_3','Masters':'level_3','Doctorate':'level_3'})

We have converted 16 categories of education column into 3 categories and we will now apply get_dummies method for one hot encoding on this columns.

cols_education = pd.get_dummies(income_data['education'], prefix= 'education')
income_data[cols_education.columns] = cols_education
income_data.drop('education', axis = 1, inplace = True)

This will create 3 new columns for the 3 level of education created above and then we drop the original education column .

The transformation of rest of the categorical variables are similar and is shown on GitHub.

3. Splitting features and target variable

It is important to split the dataset into features to be used to training the model and the target variable which will be used as the ground truth by the ML models. While splitting the data into features and target, we can also split it into train and test sets to be used for training the ML models and evaluating it as shown below:

from sklearn.model_selection import train_test_splitX = income_data.drop('income_above_50K',axis = 1)
y = income_data['income_above_50K']
X_train_org, X_test_org, Y_train, Y_test = train_test_split(X, y, random_state = 0)

The target variable income_above_50K (converted column for income) is stored as y and all the other variables except income is stored as X from the dataframe income_data . Then the feature set X is further split into Xtrain_orig for training and X_test_orig for testing. Similarly, y is further split into y_train for training the ML model and y_test for evaluating the model.

4. Scaling/normalizing the features

Scaling or normalizing is used to normalize the range of independent variables or features of data. By scaling/normalizing the independent variables, we ensure that each feature contributes approximately proportionately to the final distance. Among other reasons like being a pre-requisite step for PCA (principal component analysis), it also makes gradient descent converge much faster compared to un-scaled features. We can leverage MinMaxScaler() from the sklearn library for scaling the features as shown below:

from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train_org)
X_test = scaler.transform(X_test_org)

Model Building

Now that the income census dataset has been cleaned, processed and transformed into features, we are ready to build a predictive model on this dataset. This is the most interesting step in a data science project where we get to build the model on the dataset and see how it can help us with the predictions.

  1. Model fitting:

However, the model fitting itself is the smallest step in a data science project. We can start with the simplest logistic regression model as shown below:

from sklearn.svm import SVCsvm_lin = SVC(kernel='linear',probability=True)
svm_lin.fit(X_train,Y_train)
Y_pred_svm_lin = svm_lin.predict(X_test)
Y_predProb_svm_lin = svm_lin.predict_proba(X_test)

The SVC (Support Vector Classifier) model is imported from sklearn.svm library. Then it is fit on the training dataset using the fit() method. Finally, the predict() method is used to get predictions from the model and predict_proba() method to get probabilities for the predicted values.

2. Feature Selection:

The number of features can be reduced by either using dimensionality reduction techniques like PCA, Ridge etc or by selecting important features. We will use the feature_importances_ method of a Classifier to get the most important features as shown below:

importances = ada_rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in ada_rf.estimators_], axis=0)
feature_names = [f"feature {i}" for i in X.columns]
ada_rf_importances = pd.Series(importances, index=feature_names)

In the above code block, we are using the feature_importances_ to get the most important features which is then stored in a pandas series. We can also plot the feature importance as shown below:

Important Features

It can be observed from the above plot that age , capital_loss , capital_gain , education_num and hours_per_week are the most important features in predicting the income. Data dictionary of important features is shown below:

Important column description

3. Model Evaluation:

If you want to evaluate the model performance, you have to plot confusion matrix, print metric scores and plot ROC curve (Receiver operating characteristics) , PR curve (Precision Recall) as well as PR vs ROC curve. The dataset is imbalanced so you must not rely on accuracy scores as they are a biased indicator of model performance in imbalanced dataset. Since this is a repetitive process for all the models that we train, so we are defining some helper functions to carry out these tasks for all the models without re-writing any code as shown below:

from sklearn.metrics import confusion_matrix,accuracy_score,precision_score,recall_score,f1_score,roc_curve,precision_recall_curvedef plot_ConfusionMatrix(Y_true,Y_pred):
plt.title('Confusion Matirx')
conf_matrix = confusion_matrix(Y_true,Y_pred)
sns.heatmap(conf_matrix,annot=True,xticklabels= ['N','Y'],yticklabels=['N', 'Y'],fmt='.5g')
def scores(Y_true,Y_pred):
accuracy = accuracy_score(Y_true,Y_pred)
print("Accuracy :: {}".format("%.2f" %accuracy))
precision = precision_score(Y_true,Y_pred)
print("Precision :: {}".format("%.2f" %precision))
recall = recall_score(Y_true,Y_pred)
print("Recall :: {}".format("%.2f" %recall))
tn, fp, fn, tp = confusion_matrix(Y_true, Y_pred).ravel()
specificity = tn / (tn+fp)
print("Specificity :: {}".format("%.2f" %specificity))
fscore = f1_score(Y_true,Y_pred)
print("F1score :: {}".format("%.2f" %fscore))

In the above code block, first function plot_ConfusionMatrix uses the confusion_matrix function from sklearn.metrics library alongwith matplotlib and seaborn to plot a confusion matrix give the true and predicted values. The second function scores takes in the true and predicted values as input and returns the accuracy, precision, recall, specificity and f1 score using these functions from the sklearn.metrics library.

We can also define a helper function for plotting the curves as shown below:

def curves(Y_true,Y_pred):
plt.figure(figsize=(11,5))
plt.subplot(121)
#ROC CURVE
false_pr,true_pr,_ = roc_curve(Y_true,Y_pred[:,1])
plt.title('ROC (Receiver Operating Characteristics) curve')
plt.plot([0,1],[0,1],'--')
plt.plot(false_pr,true_pr,'.-',label='Roc_curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.subplot(122)
# PR CURVE
plt.title('Precision-Recall Curve')
precision, recall, _ = precision_recall_curve(Y_true,Y_pred[:,1])
plt.plot(recall,precision,'.-',label='PR CURVE')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.tight_layout()
plt.show()
#PR v/s ROC CURVE
plt.title('PR v/s ROC CURVE ')
plt.plot(false_pr,true_pr,'.-',label='Roc_curve')
plt.plot(recall,precision,'.-',label='PR CURVE')
plt.legend()
plt.show()

The function curves defined above takes in the actual labels as Y_true and the probability of classes as Y_pred . It plots the ROC, PR and PR v/s ROC curves which depicts positive class.

An example output for random forest classifier is shown below. You can find the full code and output for all the models on GitHub.

Random Forest

In the above output, we can observe the metric values as well the plot for confusion matrix. You can choose Recall or Precision as your key metric for evaluating the model in case of unbalanced dataset but it depends on the cost function (whether False Negative or False positive is more expensive). In cases where cost function is unknown, it is safe to use F-1 score as the key metric for evaluation. The F-1 score for the random forest is 0.66 which is among one of the good models that we have built on this dataset. Similarly, the output for the curves function on random forest classifier is shown below:

Evaluation curves

3. Ensemble methods:

Ensemble method is a machine learning technique that combines several base models in order to produce one optimal predictive model. There are several ensemble methods like voting classifier, Bagging, Pasting, Adaboosting and Gradient boosting etc. We will show a few examples here and you can find the rest on our GitHub repo.

A voting classifier is a ML estimator which trains various base models and estimates on the basis of aggregating the findings of each base estimator. An example application of voting classifier on income census data is shown below:

Voting classifier

In the above code block, we are using svm_rbf (Radial Basis Function (RBF) kernel SVM), svm_lin (Linear SVM) andKNN_model (K Nearest Neighbors) as our base estimators. Then using the VotingClassifier function of the sklearn.ensemble library, we are defining the voting classifier with soft voting which predicts the class label based on the argmax of the sums of the predicted probabilities. It can be observed that the F1 score for the model is 0.65 which is slightly better than all the individual models — svm_rbf (0.60), smv_lin (0.64) and knn_model (0.60).

We have already covered bagging and pasting in the first data science project guide, so we are going to look into Adaboosting here. An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases. Below is an application of AdaboostClassifier on top of DecisionTreeClassifier:

from sklearn.ensemble import AdaBoostClassifierada_clf_Dec = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200, algorithm="SAMME.R", learning_rate=0.5, random_state=0)
ada_clf_Dec.fit(X_train, Y_train)
y_pred_Dec = ada_clf_Dec.predict(X_test)

The output of this Adaboost classifier is shown below:

Adaboost — decision tree

The F1 score of Adaboosting on Decision Tree is 0.68 which is an improvement from other single classifiers we have seen so far. You can try out Adaboosting, bagging, pasting and Gradient boosting on this and the rest of the classifiers. Full code and application is available on GitHub.

Conclusion

We have covered all the steps required in a data science project starting. We started with collecting income census data, EDA, Feature Engineering, Model building and and finally evaluated the model on test data. The model has a high accuracy of 86% and F1-score of 0.68. We found out that the number of years in education, capital gain, age and hours per week are some of the important factors in predicting income. So, the policy makers can leverage this information for allocating more budget to education for improving the per capita income. Any business can leverage this model to predict income accurately and make informed decision based on income.

This should serve you as a guide for working on a data science project. Thank you for reading our blog and if you want to continue learning, follow the next blog on Customer Churn prediction.

Github Link: https://github.com/Laxmisankritya/House-price-prediction

Thanks to Shivam Solanki for co-authoring the blog series.

Please feel free to leave your comments and suggestions below and connect with us on LinkedIn.

https://www.linkedin.com/in/shivam-solanki-2b288319/

--

--

Laxmi Kumari

Graduate student @ UTDallas | Living life and exploring data | Data Scientist | Data Analyst