Good-Fast-Cheap: DSI 7 Hackathon 4/29/19

Derek Steffan
4 min readMay 9, 2019

--

Fairly self explanatory.

This project and subsequent blog post was written in collaboration with:

Maithili Joshi, Derek Steffan, & Remy Shea from Seattle, James Lovejoy & Jorian Stuckey from Denver.

The Project

The project management venn diagram states that something can only be two out of three things: Good, Fast, or Cheap. It will either be good and fast, good and cheap, or fast and cheap, but it will never all three at the same time. This assignment aimed to look at how this works in an applied data science setting. We had our choice of algorithm to use for our predictions and our choice of features, however we had to use the “cheap” dataset.

The Data

The data we used was from the US Census Bureau and included age, job sector, level of education, occupation type, relationship status, gender, capital gain/loss, hours worked per week, native country, investment information, and whether or not their employment was full time. With this data, we were tasked with trying to predict if a person’s income is greater than $50,000 a year. More specifically, can we predict -given these features- the probability that someone’s income is greater than $50,000.

Our Restrictions

As team ‘cheap’, we were restricted as to our number of data points relative to the other teams. With approximately 6,500 rows to work with, versus over 32,000, we were working with the low-budget version of the data. We were not restricted on number of features or types of models.

Our Process

The first step was to examine the data and determine our number of null values, which in this case were indicated as “ ?”. Much of the data contained additional spaces, which needed to be removed before additional processing could take place. Other than that, data cleaning was fairly straightforward. We engineered a few features concerning each person’s investments, and then moved on to handling our missing data.

We noticed that the missing data came from 3 columns: working class, occupation, and native country. Given that our constraint was the number of samples available to us, it was incredibly important that we retained as many data points as we could in our training set while avoiding lazy or irresponsible imputation methods. We settled on the pattern submodel approach. With this approach, we could categorize our observations based on the patterns of missing data. We found three main categories:

  • The overwhelming majority of our data, about 93% (6033/6513), was well behaved and had no missing data.
  • Roughly 2% (115/6513) of our training data lacked both the `workclass` and `occupation` variable information despite having information in the `native-country` column.
  • Conversely, around 5% (358/6513) of our observations had `workclass` and `occupation` information, but none for `native-country`.

We rather unceremoniously gave these patterns the names `pattern_1`, `pattern_2`, and `pattern_3` respectively. We grouped the seven remaining data-points with two other subpatterns into pattern 2 by dropping data or mean-imputing.

The core concept of pattern submodel missingness is that independent prediction or classification models should be constructed and trained for each pattern of missing data. In the testing set, each observation is evaluated by the model trained on data with the same missingness pattern.

Although our pattern XGBoost submodel did not outperform the model wherein columns with missing data were dropped (86% on testing data as opposed to 87% from XGBoost), this strategy for handling data is likely a more responsible use of shareholder resources when the number of features is limited, and the performance is comparable in this instance.

Alongside the pattern submodeling and XGBoost work, other group members engineered features for median income of each nationality and relationship status reduced to single/not single and actively partnered (not separated). Neither produced much of an effect and were not used in any of the final models.

We experimented with many models, including logistic regression, decision trees, KNN, gradient booster, SVC, bagged decision trees, and bagged logistic regression, as well as with stacking models (logistic regression, decision trees, and xgboost). We also took a stab at clustering algorithms, (k means and dbscan) with no more success than XGBoost with pattern submodeling alone..

Our Predictions

Our preliminary models included logistic regression, decision trees, KNN, and gradient boosting. We optimized for accuracy here, as misclassifying a negative or a positive does not have any particularly harmful implications. Our baseline was roughly 75%, meaning we would have to deal with imbalanced classes.

Perhaps surprising no-one, XGBoost performed the best. We tried a stacking method using logistic regression, decision trees, and XGBoost, but in the interest of simplicity (and time) we settled for just XGBoost using principal component analysis (PCA). This model gave us a test score accuracy of about 86%.

--

--