Person’s Income Prediction and Analysis using Python

Dipika Parbatsinh Pawar
Clique Community
Published in
7 min readJan 16, 2021
[Fig: 1] Image Source: https://justcreative.com/2019/07/16/passive-income-ideas/

Introduction:

Predictive Analysis is used to identify respective trends and behaviors, such that we can make decision for some unknown event. Apart from Data Analysis, Predictive Analysis is a combination of data collection, pattern analysis, predictive modeling and analysis of prediction as well. Coming back to the topic: at the end, we all want is money. To analyse what are the best combination of age, education, marital-status, etc. for person’s Income and for some business purpose Income prediction is so much important. Now we understood the importance of the Person’s Income prediction then here are few steps through which we can achieve the result.

[Fig: 2] The 8 steps of the pipeline of Predictive analysis

Step 1: Data collection

In the first step, there are two option for the selection of any dataset.

i) We already have any problem statement and want to find the data for same

ii) We have any dataset and want to retrieve information from it

Here we have an person’s income prediction dataset. This dataset has 48842 rows × 15 columns

#read data from drive folder
data = pd.read_csv('Drive/My Drive/adult.csv')
data
data.dtypes
[Fig: 3] Overview of the Dataset
[Fig: 4] Datatype of Each Column
  • Some columns Explanation: Workclass can be categorized as Private, Local-gov, etc.
  • fnlwgt represents final weight
  • education is the highest education that one had gain
  • Target column Income which categorized as Income >50K/≤50K

Step 2: The Usecase

  • The most important step of Data Analysis is forming Usecases/Research question. These questions varying person by person and use cases may also be different which will lead to respective answer by analysis.
  • Good Usecases may lead to better Consumer experience and also help business to analysis consumer’s buying habits, sales patterns , etc.
  • After forming those question, just sort them according to importance and pick relevant ones.

i) Prediction of “Income” based on best correlated features from age, work-class, education, marital-status, gender, etc. features.

ii) Relation between Education and Income

iii) Age wise marital status and many more usecases can be formed.

Step 3: Data Cleaning

  • To work on our usecases, it is important to clean data, because it improves data quality and leads to the better results of prediction.
  • We can clean the data by removing the missing or Null data, columns/rows which contain only null entries, redundant columns/ rows, those columns which have only one unique value, etc.
  • Here we have null entries in form of “?”. Also here education_num is encoded version of education column. So this is unnecessary repetition.
for col in data.columns:
indexNames = data[ data[col] == "?" ].index
# Delete these row indexes from dataFrame
data.drop(indexNames , inplace=True)
data)
# remove column
data=data.drop("educational-num",axis=1)

To find most relevant features, here is the table which represents the correlation score for respective column with target column.

[Fig: 5] Correlation Table in Descending order of absolute correlation

Step 4: Data Analysis

  • We already created Usecases and now we have cleaned data as well. So based on that, these are the few analysis for our usecases.
  • Analysis can be in terms of text, table, graph, etc.
  • Different type of visualization represent data well based on combination of attribute type (numerical, categorical)

Education is Categorical type variable. To display its distribution bar chart is the best choice.

[Fig: 6] Distribution of Education wrt Income
  • The education of person plays the most important role in his/her Income. (From the correlation table [Fig: 5])
  • This graph shows how many number of records are >50K/≤50K for specific Higher education.
  • As we can see that people who have Bachelors as a highest degree, have higher count in >50K income

Workclass is categorical type variable and to analyse its proportional distribution in whole dataset Pie chart is the best choice.

[Fig: 7] Portion distribution of workclass
  • This pie-chart shows the portion of particular workclass in the whole data
  • It basically gives insights of work distribution in income prediction dataset. Like 73.7% people are in private sector, etc.

Age is numeric type and marital-status is categorical type. For relation of numeric and categorical feature distribution kind of plots are used.

[Fig: 8] Age wise marital status Usecase
  • To understand the relation of marital-status with different age groups, here is a plot.
  • This Violin plot shows density of any marital-status with respect to Age.
  • As never-married group have the most density between the age range of 20–30.

Step 5: Data Preprocessing(Preparation)

  • For training the model to get better results, we need to first preprocess the data.
  • In which encode the categorical attributes to make them numerical type, Normalization of numerical attributes, best related feature selection can be done.

Normalization

The main purpose is to scale the numerical value from all the columns to same scale.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
numerical = ['age','capital-gain','capital-loss','hours-per-week']

for i in numerical:
data[i]-=data[i].min()
data[i] /= data[i].max()

Encoding:

Encoding converts categorical variable into baskets from 0 to n(total number of categories). The importance is the numeric values are easy to be process and most of ML models need numeric input/output.

features_final = pd.get_dummies(data)
for col in data.columns:
if col not in numerical:
temp={}
for i in range(len(data[col].unique())):
temp[data[col].unique()[i]]=i
work = data[col].map(temp)
data[col]=work

encoded = list(data.columns)
print("{} total features after one-hot encoding".format(len(encoded)))
encoded

Step 6: Model Training

  • Now we are ready to train a model. Here Income, our target column is categorical type. So the classification algorithms will be used because we want categorized output in some category.
  • After trying implementation of few classification algorithm like SVM, Regression, Naive Bayes, Random Forest, etc. in scikit-learn, here are some results in terms of accuracy.
  • For these models, input are the selected most correlated features and output is to predict income(≥50K / <50K).

Split the dataset in 70–30%

from imblearn.over_sampling import SMOTE

x_train, x_test, y_train, y_test = train_test_split(features_final, income, test_size=0.30, random_state=1)
X_train, Y_train = SMOTE().fit_sample(x_train, y_train)

Decision Tree Algorithm

from sklearn import tree
dt =tree.DecisionTreeClassifier(criterion='entropy',
min_samples_split=8,max_depth=10)
dt = dt.fit(X_train, Y_train)
y_pred_dt = dt.predict(x_test)
y_train_score_dt = dt.predict(X_train)
print(accuracy_score(y_test, y_pred_dt, normalize=False, sample_weight=None)*100/len(y_test))
print(accuracy_score(Y_train, y_train_score_dt, normalize=False, sample_weight=None)*100/len(Y_train))

Random Forest Algorithm

rf=RandomForestClassifier(min_samples_split=20)
rf.fit(X_train, Y_train)
y_pred = rf.predict(x_test)
y_test_pred = rf.predict(X_train)
print(accuracy_score(y_test, y_pred, normalize=False, sample_weight=None)*100/len(y_test))
print(accuracy_score(Y_train, y_test_pred, normalize=False, sample_weight=None)*100/len(Y_train))

Step 7: Performance Evaluation

After the implementation of those 5 algorithms, Random forest performs best. The best model is not based on the accuracy result but also on precision, recall , True positive/negative(classification report), etc.

[Fig: 9] 1–0 as income ≥50K and <50K
  • This is Confusion matrix for Random Forest Implementation
  • Total length of test dataset is 13567, out of which 8687 records are classified as TP ans so on.
[Fig: 10] Classification Report

Step 8: Final predict analysis

  • Here is the final outcome in form of comparison for real and predictive income gender wise.
  • Where income =0 means <50K and 1 means ≥50K
  • From the below graphs, the misleading behavior in Male is more than in female for Income prediction.
[Fig: 11] Comparison of predicted and real income gender wise

Conclusion:

According to correlation table, only some of the features are directly related in Income prediction. The selection of those column play an important role for model accuracy. Also here Random Forest suits best on this dataset. Misleading prediction rate in Male is more than in Female. The dataset need not always be clean or small. So we have to define a way , a method which we can generalize on each and every kind of dataset. Here the workflow explained step by step which you can correlate with your respective dataset. Also the Different type of visualization represent data well based on combination of attribute type (numerical, categorical) and the selection of classification algorithm completely based on dataset.

Full code can be found here. The implementation is exclusively using Python language.

References:

[1] N. Chakrabarty and S. Biswas, “A Statistical Approach to Adult Census Income Level Prediction,” 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida (UP), India, 2018, pp. 207–212, doi: 10.1109/ICACCCN.2018.8748528

[2] https://www.ahajournals.org/doi/full/10.1161/01.STR.31.4.869

[3]https://seaborn.pydata.org/introduction.html#:~:text=Seaborn%20is%20a%20library%20for,explore%20and%20understand%20your%20data.

--

--

Dipika Parbatsinh Pawar
Clique Community

Student at Ahmedabad University| Data Science Enthusiast| Backend Developer