Reflections on my First Kaggle Competition

The Women in Data Science 2019 Datathon is Underway

Now that the new Women in Data Science (WiDS) competition is in progress, I want to share my experience with the competition last year in hopes that it encourages more people to compete this year. The competition ends February 27th, and there is still time for you to get involved. You must form your team by February 21.

Image credit: https://www.widsconference.org/datathon.html

In February 2018, Kaggle had a competition sponsored by Women in Data Science conference and called it the WiDS Datathon. The mission of this competition was to help poor women in India. I heard about it from my wonderful mentor Susan Malaika and immediately signed up. I knew about Kaggle competitions previously, but never had the time to participate. For this one I decided to find the time, and that decision proved very helpful for my career. After many years developing code for IBM SPSS analytic components, I finally got an opportunity to try my hand in some practical applications of Machine Learning. I used my favorite tools, IBM SPSS Statistics 25 and IBM SPSS Modeler, to help me with that work.

Also joining the team were my colleagues Shatabdi Choudhury and Chunling Zhang who had worked with me at SPSS for many years. They brought with them valuable experience of earlier Kaggle competitions, experience with Python open source models and the idea of “stacking” models which I had not heard before. I should probably add that such competitions require at least half of the teams to be female, which was not a problem for us.

Our mission was to build a machine learning model to predict if the user was a woman based on various attributes specified during a mobile phone signup transaction. Here are the steps we took to complete this task!

Reviewing The Data

The data supplied to us by Kaggle had more than 17 thousand rows and about 1500 columns. There was a document describing the fields, but it had many problems: some data columns had no corresponding descriptions, some descriptions had no corresponding columns in the data, other descriptions listed some categories, but the actual data had more values. After some contestants complained about the problems a second, slightly improved, version was provided but it still had problems.

When I first looked at the data, it seemed to be mostly numeric, so I thought “Oh, good, maybe we can use a Discriminant model!” Then I read the descriptions and learned that most of those numeric values, for some fields from 1 to 32 or more, were representing various categories so they could not be treated as continuous. Only a few fields were truly continuous.

One other observation I made quickly was that fields AA5 and AA6 had complementary missing values (in each case exactly one of them had a missing value and the other one a valid value), and their valid values were all different: AA5 had values 1 through 5 and AA6 had 6 through 8, see Figure 1 below. Thus, it made sense to combine those two variables into a new one, with all valid values. I called it “NewAA”. This merge also made sense based on their descriptions. I think this is an example of a data preparation step that likely won’t be automated soon, as human touch seems to be required to find such feature combination.

Figure 1. Complementary values in AA5 and AA6

Further analysis of the data revealed that a number of fields had all missing or all constant values, and several also had very few valid values. Additionally, many fields had almost constant values. Clearly, all those fields had to be removed. Given the large number of fields in the data set, manually finding and removing all those fields could be very time-consuming and error-prone. Fortunately, IBM SPSS Statistics includes the command VALIDATE DATA (which internally uses the Feature Selection component that my colleague Leonid Feygin and I implemented in the early 2000’s). This option can be found by going to Data menu item, then Validation, Validate Data.

Data Exploration and Preparation

Once we specified our “analysis variables”, we could select the criteria for variables to be considered unsuitable for further analysis, as shown in Figure 2 below.

Figure 2. Selecting criteria for filtering out unhelpful fields.

As a part of data exploration we also used several convenient procedures, such as:

· DESCRIPTIVES computed valid counts, means, standard deviations, minima and maxima for all numeric fields;

· FREQUENCIES reported counts for each value of each variable (results can be less helpful on continuous fields with a large number of distinct values);

· CTABLES allowed us to get various reports on category combinations of several variables.

The Automatic Data Preparation (ADP) procedure helps with predictor selection, missing value imputation, supervised category merging of categorical predictors with many categories. I specified the target and all predictors (except the ID variable and AA5, AA6 that have been merged), reference Figure 3 below.

Figure 3. Selecting fields for automatic data preparation.

In the Settings tab, we can specify criteria for excluding fields based on percentages of missing or constant values and set the maximum number of categories allowed in a categorical predictor, see Figure 4 below.

We set the measurement levels (continuous or categorical) for the variables based on their descriptions. Since we didn’t want ADP to change the measurement levels based on numbers of distinct values we turned that feature off.

Figure 4. Specifying variable filtering criteria for automatic data preparation.

The “Improve Data Quality” menu let us control outlier treatment and missing value replacement, see Figure 5 below. Unfortunately, there is only one option for replacing a missing value: a mode for a nominal field, a median for an ordinal one, and the mean for continuous. I wish there were an option to specify a value to be used to replace missing values but that might require a space for each field which is not easy to design. I felt it necessary to perform my own missing value replacement based on the contents of each field and meanings of its categories. For some categorical fields it made the most sense to create a new category for the missing values, for others there was already a category that essentially described a missing value, so the system missing value could be mapped to it. For certain continuous predictors the missing value was logically mapped to 0, while for others the mean substitution was a better choice.

Figure 5. Replacing missing values with automatic data preparation.

To do the custom missing value treatment, I first ran the procedure with the standard missing replacement for inputs and the option to paste resulting syntax, then I manually edited the syntax to use whatever missing replacement I wanted. Considering the large number of predictors this was a lot of work, but it proved useful.

Tips and Tricks

A very helpful feature is supervised category merging. As I mentioned, many categorical fields had large numbers of categories. The supervised category merging feature (see figure 6 below) finds groups of categories that correlate with the target field in a similar fashion, so we can get a categorical predictor with 3 to 5 categories instead of 30 or more. Such predictors can result in smaller models, that are faster to build and score, and smaller models usually generalize to new data better.

Figure 6. Supervised category merging in automatic data preparation.

In terms of models, it’s important to randomly split the data into training and testing, and to test any models we built on the test data to make sure that the model generalizes well and does not overfit.

Having the syntax for data transformations was also very useful because to submit our solution we needed to get predictions (scores) for the new dataset, and it is necessary to apply the same data preparation steps to the new data before pushing it into the model(s) for scoring.

Building the Models

Once the data was prepared, we built a number of models on it, always checking the accuracy on training and test data. Some models like Naïve Bayes or Radial Basis Function (RBF) Neural network produced very low accuracy even on training data, so we did not consider them further. Building a Multi-Layer Perceptron (MLP) neural network on the entire set of potential predictors was impractical due to the huge number of weights required, so I built such models separately on three parts of the predictors, then collected the most important predictors from each of those models and built a model on those. This gave me a relatively accurate model. Another model that seemed to work relatively well was the Chaid tree. Unfortunately, IBM SPSS Statistics does not include options to create boosted models. I used IBM SPSS Modeler to build a boosted Chaid tree and a boosted MLP, and those worked relatively well. Early on I also tried some less traditional models, such as generalized association rules in Statistics and Decision List in Modeler. Both of these found a very strong rule that a person identifying as a homemaker is most likely a woman, but other rules had much lower confidence.

In order to generate predictions on the new data as required for the submission, I exported PMML from each of the models I built, then scored the new data using it. You can read more about PMML in my recent blog.

Shatabdi had recent experience of taking modern Machine Learning courses, so she built multiple XGBoost and Light GBM models on the prepared data that I gave her. We also built stacked models following Chunling’s suggestions, and those got the highest scores for us. We ended up in the top 14% of the competition. Not bad considering that we started late and were also busy with our work and families. Participation in this competition gave us better understanding of real-world ML problems, of the importance and difficulty of data preparation, and of most effective modeling approaches. We enjoyed the process and learned a lot.

It’s Not Too Late!

The WiDS 2019 Datathon is under way, it started January 29th and is running until February 27, the teams must be formed by February 21. This year, the competition is focusing on image processing, so we probably need to learn more about deep learning and start applying the theory to this very specific problem. Chunling has left IBM, so now we will have new teammates. We are all pretty busy, but don’t want to miss this opportunity to see practical applications of all this interesting theory. Winners will be announced at the WiDS Conference at Stanford University on March 4, 2019. I look forward to reporting back on our experience with the competition after it ends!