Salary Prediction Classification

15 min readDec 13, 2023

Overview

This project involves the analysis of census data, specifically census data from the 1994 Census database. The dataset consists of fifteen predictor variables and one target variable, salary, which indicates whether or not a person makes over 50K a year

In this project, Python was the language of choice although R could have certainly been used as well. I personally find that Python is much more suited compared to R as the regression analysis portion of this assignment will involve machine learning techniques that are better suited for Python compared to R.

Data was obtained from Kaggle, an online website that hosts various data science competitions. The following is the link to the CSV file that was used for this project: https://www.kaggle.com/datasets/ayessa/salary-prediction-classification

Objective

The objective of this project is to predict whether or not a person makes over 50K a year, based on an employee’s occupation, education, marital status, background, and other employee attributes. To acheive the aforementioned objective, logistic regression models will be used. Moroever, since the dataset is large, a train-test split methodology will be implemented as well

Review of Data Sources

The data that was used for this assignment (salary.csv) was provided by Kaggle and the pandas library in Python was used to load the data into the dataframe: salary_data (Salary Dataset).

The dataframe contain 15 columns and 32561 rows. The dataframe did not contain any columns that contained null values, so imputation was not required. However, the columns were renamed and the categorical columns were converted to numerical columns so that they can be used as predictor variables.

To convert the categorical columns to numerical columns, a mapping was implemented in which unique values will be assigned to the individual categories of a categorical variable. For example, the categorical variable marital-status has seven categories: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. Therefore a value of 1 will be assigned to the category Married-civ-spouse, a value of 2 will be assigned to the category Divorced, a value of 3 will be assigned to the category Never-married, etc.

The next step was to perform exploratory data analysis and the following table summarizes the variables that are present in the employee salary prediction dataframe

Exploratory Data Analysis (EDA)

EDA was the next step of this project, the goal being to get a better understanding of the data at large. EDA is comprised of three such components: descriptive statistics, histograms, and correlation analysis. For the purposes of this article, I will focus the EDA more on the histograms and the correlation analysis since both were instrumental in the subsequent regression analysis portion of this project.

Histograms were generated to better understand the underlying distribution of the independent variables while correlation analysis was instrumental in determining the predictor variables that will ultimately be used to predict whether or not a person makes over 50K a year. In particular, the EDA focused on the following aspects of the employee salary prediction dataset:

Occupation (work class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked), occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces), hours per week)
Education (education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool), education num)
Marital Status (marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse), relationship (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried))
Employee Background (age, race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black), sex, native country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US (Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands))
Other attributes (capital gain, capital loss)

Histograms

Occupation

The variables work class both seem to mostly resemble a positively (right) skewed distribution, the variable hours per week worked seems to mostly resemble a normal distribution, and the variable occupation seems to mostly resemble a multimodal distribution as well. The mean of the variables work class, occupation, and hours per week worked are 3.30997, 25.666411, and 40.437456 respectively while the standard deviations are 1.225728, 3.386119, and 12.347429 respectively. The distributions of the variables imply the following:

In the distribution of work class, on average most employees fall under work class category 3 (Private) but with a long tail of employees who fall under other work class categories (4 (Federal-gov), 5 (Local-gov), 6 (unknown work class category), 7 (Self-emp-inc))
In the distribution of hours per week worked, on average most employees work 40 hours per week but with some employees who work less hours per week and some employees who work more hours per week
In the distribution of occupation, multiple peaks are present. In particular, peaks are present at approximately 2.5, 5, and 7.5

Education

The variable education seems to mostly resemble a positively (right) skewed distribution while the variable education num seems to mostly resemble a normal distribution. The mean of the variables education and education num are 4.424465 and 10.080679 respectively while the standard deviation are 3.453582 and 2.572720 respectively, The distributions of the variables imply the following:

In the distribution of education, on average most employees fall under education categories 2 (HS-grad) and 6 (Some-college) but with a long tail of employees who fall under other education categories
In the distribution of education num, on average most employees have nine years of education but with some employees who have fewer years of education and some employees who have more years of education

Marital Status

The variables marital status and relationship both seem to mostly resemble a positively (right) skewed distribution. The mean of the variables marital status and relationship are 2.083781 and 2.542397 respectively while the standard deviation are 1.251381 and 1.437431 respectively, The distributions of the variables imply the following:

In the distribution of marital status, on average most employees fall under marital status category 2 (Married-civ-spouse) but with a long tail of employees who fall under other marital status categories (3 (Divorced), 4 (Married-spouse-absent), 5 (Separated), 6 (Married-AF-spouse), 7 (Widowed))
In the distribution of relationship, on average most employees fall under relationship category 2 (Husband) but with a long tail of employees who fall under other relationship categories (3 (Wife), 4 (Own-child), 5 (Unmarried), 6 (Other-relative))

Employee Background

The variables age, race, and native country all seem to mostly resemble a positively (right) skewed distribution while for the variable sex, the distribution seems to be a bit unclear.

The mean of the variables age, race, sex, and native country are 38.581647, 1.221707, 1.330795, and 2.290317 respectively while the standard deviations are 13.640433, 0.627348, 0.470506, and 5.045373 respectively. The distributions of the variables age, race, sex, and native country imply the following:

In the distribution of age, on average most employees are approximately 40 years old but with a long tail of employees that are older
In the distribution of race, on average most employees fall under racial category 1 (White) but with a long tail of employees who fall under other racial categories (2 (Black), 3 (Asian-Pac-Islander), 4 (Amer-Indian-Eskimo))
In the distribution of native country, on average most employees fall under native country category 0 (United-States) abut with a long tail of employees who fall under other categories as well
In the distribution of sex, the majority of employees are male while a good number of employees are female as well

Correlation Analysis

Correlation matrices were generated to better understand the relationship between the variables of interest and the dependent (response) variable (salary), which represents whether a person makes over 50K a year. The correlation matrices will also be crucial in determining which variables of interest best predict whether a person makes over 50K a year. In other words, the correlation matrices will be used to determine which variables of interest will end up being the independent variables in the regression model.

Its also worth noting that variables that either have a correlation greater than 0.3 or less than -0.3 are suitable variables for predicting whether a person makes over 50K a year since a correlation of 0.3 indicates a moderate positive relationship while a correlation of -0.3 indicates a moderate negative relationship. While using the correlation values of the independent variables is certainly not a hard and fast rule for choosing the independent variables that best predict whether a person makes over 50K a year, correlation values certainly serve as a guideline for choosing suitable and appropriate predictor variables for predicting whether a person makes over 50K a year.

Occupation

For occupation, the correlation between the dependent variable salary and the independent variables work class, occupation, and hours per week was determined. The correlation values seem to indicate a weak positive relationship between hours per week and salary. Likewise, the correlation values seem to indicate a negligible relationship between occupation and salary as well as a negligible relationship between work class and salary. Therefore, the variables work class, occupation, and hours per week dont seem to be good indicator variables for the dependent variable salary

Education

For education, the correlation between the dependent variable salary and the independent variables education and education num was determined. The correlation values seems to indicate a moderate positive relationship between education num and salary while the correlation values seems to indicate a negligible relationship between education and salary. Therefore, the variable educationnum is a good predictor variable of salary while the variable education isnt a good predictor variable of salary

Marital Status

For marital status, the correlation between the dependent variable salary and the independent variables marital status and relationship was determined. The correlation values seem to indicate a negligible relationship between relationship and salary as well as no relationship (zero correlation) between marital status and salary. Therefore, neither marital status nor relationship are good predictor variables of salary

Employee Background

For employee background, the correlation between the dependent variable salary and the independent variables age, race, sex, and native country was determined. For the variables age and sex, the correlation values seem to indicate a weak positive relationship between salary and age as well as a weak negative relationship between salary and sex. Meanwhile, the correlation values seem to indicate a negligible relationship between race and salary as well as a negligible relationship between native country and salary. Therefore, none of the variables are good predictor variables of salary

Regression Analysis

Now that the EDA portion has been completed, the last step is to perform a regression analysis in order to determine the best performing model and ultimately which model best predicts salary, the dependent (response) variable. Based on the results of the correlation analysis, the following variables were chosen as independent (predictor) variables for predicting whether a person makes over 50K a year: ‘workclass’, ‘occupation’, ‘hoursperweek’, ‘education’, ‘educationnum’, ‘maritalstatus’, ‘relationship’, ‘age’, ‘race’, ‘sex’, ‘nativecountry’. A total of eleven such independent variables were chosen to predict whether a person makes over 50K a year.

As part of the regression analysis, a total of four initial models were created to predict whether a person makes over 50K a year

Model 1 — predict whether a person makes over 50K a year using workclass, occupation, hours per week
Model 2 — predict whether a person makes over 50K a year using education and number of years of education
Model 3 — predict whether a person makes over 50K a year using marital status and relationship
Model 4 — predict whether a person makes over 50K a year using age, race, sex, and native country

In addition, seven additional models were created in order to evaluate if initial model performance could be improved (models 6, 7, 8, 9, 11, 12, 14). In total, eleven such models were created to predict whether a person makes over 50K a year. In order to evaluate model performance, logistic regression metrics such as accuracy, AIC, AUC, were used and the model that ideally has the highest accuracy, the highest, AUC and the lowest AIC will be chosen as the model of choice for predicting whether a person makes over 50K a year. A high accuracy indicates that the model performed well while a high AUC indicates how well the model is able to correctly classify observations into classes. In addition, a low AIC indicates good model fitment.

Moreover, the classification report will be used to evaluate model performance with respect to other metrics such as precision, recall, F1 Score and Support while the confusion matrix will be used to evaluate model performance with respect to the number of true positives, true negatives, false positives, and false negatives. Since a lot of models were formulated as part of the analysis, only the best performing models with respect to accuracy and AIC will be highlighted. Its worth noting that a train test split (70%, 30%) was used for this project since the employee salary dataset was large (32561 rows × 15 columns)

Initial Models

Model Accuracy, AUC, AIC

Model 2 is the best performing model with respect to accuracy while model 4 is the best performing model with respect to AUC and AIC. Model 2 has the highest accuracy compared to the other models (78.23) while model 4 has the highest AUC (0.722670) and lowest AIC (32788.53) compared to the other models. An accuracy of 78.23 for model 2 implies that model 2 made the correct prediction for whether a person makes over 50K a year 78.23% of the time, an AUC of 0.722670 for model 4 implies acceptable discrimination meaning that model 4 does a decent job in correctly classifying observations into categories, and an AIC of 32788.53 indicates poor model fitment

With regards to accuracy, models 3, 1, and 4 all have similar model performance and with regards to AUC, only model 2 has similar AUC compared to model 4. Meanwhile models 1 and 3 have a significantly lower AUC compared to models 4 and 2 and such AUC values indicate poor discrimination, meaning that models 1 and 3 do a poor job in correctly classifying observations into categories. With regards to AUC, models 3, 1, and 2 all have similar model fitment but worse model fitment when compared to model 4.

Its worth noting that the worst performing models with respect to accuracy, AUC, and AIC are models 4, 3, and 2. Model 4 has an accuracy of 74.77, model 3 has an AUC of 0.557392, and model 2 has an AUC of 37,609.33

Classification Report

Model 2

Precision: Out of all the employees that the model predicted would make more than 50k, 62% actually did.

Recall: Out of all the employees that actually did make more than 50k, the model predicted this outcome correctly for only 21% of those employees.

F1-Score: Since this value isn’t very close to 1, it tells us that the model does a poor job of predicting whether a person makes over 50K a year.

Support: These values simply tell us how many employees belonged to each class in the test dataset. We can see that among the employees in the test dataset, 7,455 employees did not make more than 50k per year while 2,314 employees did make more than 50k per year

Model 4

Precision: Out of all the employees that the model predicted would make more than 50k, 32% actually did.

Recall: Out of all the employees that actually did make more than 50k, the model predicted this outcome correctly for only 6% of those employees.

F1-Score: Since this value isn’t very close to 1, it tells us that the model does a poor job of predicting whether a person makes over 50K a year.

Confusion Matrix

Model 2

Number of true positive (predicted = true, actual = true) predictions: 7,151
Number of true negative (predicted = false, actual = false) predictions: 491
Number of false positive (predicted = true, actual = false) predictions: 1,823
Number of false negative (predicted = false, actual = true) predictions: 304

Model 4

Number of true positive (predicted = true, actual = true) predictions: 7,164
Number of true negative (predicted = false, actual = false) predictions: 140
Number of false positive (predicted = true, actual = false) predictions: 2,174
Number of false negative (predicted = false, actual = true) predictions: 291

Additional Models

Model Accuracy, AUC, AIC

Model 6 is the best performing model with respect to accuracy, AUC, and AIC. Model 6 has an accuracy of 80.4, an AUC of 0.804931, and an AIC of 27669.74 and an accuracy of 80.4 implies that model 6 made the correct prediction for whether a person makes over 50K a year 80.4% of the time, an AUC of 0.804931 for model 6 implies excellent discrimination meaning that model 6 does a good job in correctly classifying observations into categories, and an AIC of 27669.74 indicates poor model fitment

With regards to accuracy, models 12, 7, 11, 9, 8, and 4 all have similar model performance compared to model 6 and with regards to AUC, model 12 has a similar AUC as well. Meanwhile, models 9, 7, 11, 14, and 8 all have much lower AUC values compared to models 6 and 12 and in fact the AUC values imply acceptable discrimination, meaning that models 9, 7, 11, 14, and 8 do a decent job in correctly classifying observations into categories while models 6 and 12 have AUC values that suggest excellent discrimination, meaning that models 6 and 12 does a good job in correctly classifying observations into categories

Its worth noting that the worst performing models with respect to accuracy, AUC, and AIC are models 14 and 8. Model 14 has an accuracy of 74.45 while model 8 has an AUC of 0.705008 and 32998.07