MITx: 15.071x Earnings prediction from demographics

The aim of this investigation was to utilize logistic regression models and classification and regression tree (CART) models to predict whether an individual earned above $50,000 or not. The demographic information available was census data from 1994 providing information including

  • Age
  • Work classification
  • Level of education
  • Marital status
  • Status in household
  • Race
  • Sex
  • Capital gains and losses for the year
  • Weekly work hours
  • Native country

The data was split into a training set and a test set using the caTools package, with a split ratio of 0.6.

Using a logistic regression model to predict the earnings using all other variables in the data,

logModel1 = glm(over50k ~ ., data=train, family="binomial") summary(logModel1)

we see that all variables except race and native country are seen to be significant.
 Predictions were made using this logistic regression model on the test data set. The accuracy of the model was ~85.11%, compared to a baseline accuracy of 75.94%.

table(test$over50k, logPred>=0.5 ) prop.table( table(test$over50k) )

Next, the classification tree methods were used to make the predictions for the earnings variable. Building a model using all the dependent variables in the dataset, we obtain a tree with 4 splits.

CARTModel1 = rpart(over50k ~ ., data=train, method="class") prp(CARTModel1)

The accuracy of this model on the test data was ~84.43%.

CART1pred = predict(CARTModel1, newdata=test, type="class") table(test$over50k, CART1pred)

Random forest model

Due to computationally intensive nature of rnadom forest models, and the size of the data involved (31978 observations), a sample of 2000 observations was taken from the training set to build the model. When making predictions on the test set, the accuracy for this model was ~85%.

rForestPred = predict(rForest1, newdata=test) table(test$over50k, rForestPred) (9437+1435)/nrow(test)

Since the random forest models build a large collection of trees, there is a reduction in interpretability of the selection. Hence, some other metrics were used to assess the importance of the variables in the model, such as the number of splits each variable was involved in

vu = varUsed(rForest1, count=TRUE) vusorted = sort(vu, decreasing = FALSE, index.return = TRUE) dotchart(vusorted$x, names(rForest1$forest$xlevels[vusorted$ix]))

or by measuring the “impurity”, which relates to how homogeneous the various buckets are. The more important a variable is, the more it contributes to reduction in impurity.

varImpPlot(rForest1)

Classification trees with cp selected via k-fold cross-validation

The complexity parameter “cp” can be used to control the complexity of your trees. A very small value can lead to a large number of buckets, and vice versa. k-fold cross-validation is a technique that can assist in selecting an optimal value for cp. Using a 10-fold cross-validation and cp values from 0.002 to 0.1, we obtain a suggested cp values of 0.002.

tr = train(over50k ~ ., data = train, method = "rpart", trControl = tr.control, tuneGrid = cartGrid)

When we build a new CART model using this value of cp, the accuracy is ~85.76% compared to ~84.43% for the previous classification model. However, this comes at the cost of increased complexity, since there are now 18 splits in the model.

CARTModel2 = rpart(over50k~., data=train, method="class", cp=0.002) CART2pred = predict(CARTModel2, newdata = test, type="class") table(test$over50k, CART2pred) prp(CARTModel2)

In some cases, simple, interpretable models might be better than more complex models even if there is a slight improvement in accuracy. In any case, the classification models here can be more easily interpreted than the logistic regression model for someone without a background in statistics or mathematical modelling.

Like this:

Like Loading…


Originally published at resolvereshyam.wordpress.com on May 17, 2016.