Classification process and visualization of individual’s annual income using US Census Bureau of 05/19/96
Introduction
In this report aims to answer four (business) questions:
- Identify the most influential variables in the classification process of individual’s annual income exceed 50k USD based on US census data.
- Determine how these variables influence the classification process.
- Is variable Sex and Race important to determine any class, using this data set?
- How is the distribution by native country of people in each class, using this data set?
Data set used is from University of Toronto (URL: http://www.cs.toronto.edu/~delve/data/adult/desc.html ), indicated in: https://github.com/caesar0301/awesome-Public-datasets#machine-learning
Once rules of decision tree are obtained we implement it in Tableau Public ( https://public.tableau.com ) to visualize the behavior of the variables in order to understand better how annual income exceeds 50k USD in classification process.
Data cleaning and model generation
We show script in R Language ( https://cran.r-project.org/ ) for data cleaning and model generation
Training set

We can notice some variables don’t have well format. And some ? (NA).

Now variables have well format

A summary of training data set is shown below

Notice that our training data set corresponds to summary explained in file adult.names.
Dependent variable (class) is unbalanced. Class ‘<=50K’ has 22654 observations (75.11%) and ‘>50K’ has 7508 observations (24.89%).
Let’s to do a decision tree

We can notice that relationship is the most important variable. Let’s see more details of decision tree.

We can notice that variables relationship and marital_status have same importance, nevertheless, relationship is used. Our tree also use capital_gain and education_num (question 2 is answered interpreting this decision tree). Variables like Sex and Race is not important to classify (question 3 is answered interpreting this decision tree).
Calculating confusion matrix

Calculating error

Around 15.89% of error.
ROC curve and Area Under Curve (AUC)

Analyzing matrix confusion, error, ROC curve and AUC the model generated using decision tree has good (but not too well) behavior in training. Let see behavior of decision tree using test data set.
Test set

Similar with training set, we notice (that) some variables don’t have well format. And some ? (NA).

Now variables have well format

A summary of test data set is shown below

Notice that our test data set corresponds to summary explained in file adult.names.
Again, dependent variable (class) is unbalanced. Class ‘<=50K’ has 11360 observations (75.43%) and ‘>50K’ has 3700 observations (24.57%).
Calculating confusion matrix using decision tree generated using training data set.

Calculating error

Around 16.10% of error in test set. Similar to obtained using training data set.
ROC curve and Area Under Curve (AUC)

Analyzing matrix confusion, error, ROC curve and AUC the model generated using decision tree has good (but not too well) behavior in test, similar using training data set.
Storyboards
Once we have seen summary of training and test data sets and behavior of decision tree, we can make a storyboard of future visualizations.


Done storyboards of future visualizations, we can implement it in a easy way
Visualization
Using Tableau Public we will implement our decision tree using training data set.
All variables identified are in a Tableu Dashboard called Data. In this Dashboard you can filter by filten (if education_num >=12 then is filten=1, else filten=0, like decision tree), filt1 (if capital_gain >=5096 then is filt1=1, else filt1=0) and filt2 (if capital_gain >=7074 then is filt2=1, else filt2=0).
In the next URL you can interact with Dashboard “Data” in Tableau Public:
https://public.tableau.com/profile/alexander.molero#!/vizhome/Proy1/data
Also, in next figure is shown

For example, if you want to see this part of decision tree

In next visualization we implement these rules in Tableu Public

In variable relationship select all unless “Husband” and “Wife” and only activate filt2=1, then, corresponds with selected part of decision tree.
Different visualizations using Tableu Public are showing how variables like education_num, relationship and capital_gain (with filters) influence the classification process, answering question 2. You’re invited to discover how this influence is! We will continue with another example.
Additionally, in worksheet map we can filter by native country of people of data set, answering question 4. For example, if select all unless United States we can see in circle green: class with >=50K and in circle red: class with <50K. In next figure we can see this.

Other example, using our decision tree

In Tableu Public will be like this

In variable relationship select only “Husband” and “Wife” and only activate filten=1, then, corresponds with selected part of decision tree.
Again, in worksheet map if select only United States we can see in circle green: class with >=50K and in circle red: class with <50K. In next figure we can see this.

Conclusions
We combine the better of two worlds: statistical analysis, in case of classification model (using R) and visualizing behavior of data and model (using Tableu Public). We can demonstrate that these kinds of solutions are useful in order to obtain fastest answers to business questions.
Related to training data set is interesting notice that if a person is married (husband or wife) and has a education_num<12 this person need a capital_gain lower -5096- respect if the same person is Not-in-family, Other-relative, Own-child or unmarried (in this case this person need a capital_gain higher -7074-). In this case a married person has more probabilities to exceed annual income of 50k USD.
These kinds of conclusions could be obtained using Tableau Public, interacting between variables.
