Data science: From the Steel Industry to Kaggle
I have just finished my first Kaggle competition and… It was quite disappointing.
I have been working as a data scientist in the heavy steel industry for two years now. Big machinery, dust, dirt, but also a lot of very challenging data. We face quality problems, predictive maintenance or production set-up analysis and optimization. For each of those problems we really need to sit down, understand it, discuss internally which could be the best approach, interact with the production experts and finally take all the modelling decisions.
It is an absorbing job, with a lot of questions to be answered, so there is not usually room for other projects. However, this time I was too intrigued by Kaggle competitions so I took some time off and I engaged myself, along with another colleague, in Porto seguro’s safe driver competition.
I got there looking for a big challenge and what I found was just a huge optimization problem where people were squeezing the algorithms up to the third decimal. I mean, by the end of the competition hundreds of people had submitted scores around 0.285 and 0.290. There were several open kernels at that time that were obtaining a public score* of 0.286 with a cross-validation standard deviation of 0.005. It means that depending on the validation fold their scores could range –approximately- from 0.28 to 0.29, so the score differences were just depending on how each model classified a small bunch of points. People were just optimizing the noise to go up in the public leaderboard!!
*For those who have never been in a Kaggle competition, you just need to know that there are two different phases. During the competition there is a public test that can be used by the competitors to assess how their models are performing. However, the final classification depends on the behavior of those models over a private dataset that nobody has seen before.
Despite of my little disappointment, as from any other experience in life, I have got also some valuable lessons learned and insights. In this article I aim to resume my impressions and I go through some things that I really missed from a data science challenge and the main differences that I found comparing with what I usually do when facing real problems in my daily job.
Is it just machine learning??
“In this competition, you will predict the probability that an auto insurance policy holder files a claim”. This was the only description of the problem. Clear.
Besides, it was also explained that there were four different groups of features, with its own identification, and also three different types of variables: binary, categorical and ordinal. Finally, missing values were stored as -1 and the metric that would be used was the –very well known in economy — normalized gini coefficient. And that’s it!
Moreover, due to the intrinsic nature of the data it was almost impossible to extend this information with external sources. In this case, everything must be inferred from the data itself. So, let’s go with that.
In this Kaggle competition we got a clean but masked dataset where no expert knowledge could be used. The problem had an associated tiny description but without enough information to understand the logical flow of the problem. Thus, it made no sense to try to understand the real-life problem to assess if, for instance, there was any hierarchy or logical structure in the data that could be exploited to effectively model it. The target was well known and defined and the goal was to get the higher possible score based on the normalized gini coefficient. Interpretability and extracting insights from the data did not have room there: pure prediction.
As many people do, we started with a traditional “kick-start”: data visualization, checking liner correlations between predictors and between predictors and the target variable, understanding the general numbers of the problem, etc. Actually, we were facing a highly imbalance problem (ratio of claims lower than 0.04%), where the highest correlation between two predictors was around 0.7 and it was not possible to reduce the input space significantly, at least in a straightforward manner using techniques such as Principal Components Analysis (PCA).
With no usable expert knowledge or interpretation of the problem, feature engineering became a trial an error process. Feature selection could be approached with an iterative process or relying on the feature importance score from algorithms such as Random Forest or Boosting Trees.
Moreover, such an imbalance problem was a pretty good candidate for using ensembles to combine different algorithms and ratios of imbalance. Ensembles and stacks allow to combine the output from several algorithms, working with the same or different sampling ratios, where each of them might extract distinct patterns. Thus, combining for instance gradient boosting trees with neural networks will allow to average the output of two algorithms that might create different decision boundaries when modelling the data, leading to a likely better overall performance than the one that can be obtained from none of those algorithms on their own.
Therefore, there were many points of freedom that could be tuned. From the sampling rate, to the categorical encoding, the missing values management, the architecture of the ensembles or the hyperparameter optimization strategy. At this point, although the experience could play an important role to decide which combinations were more likely to succeed, it all came down to a huge optimization problem.
So, the questions is whether it is still data science or just machine learning…
Kaggle community as a massive genetic algorithm.
A genetic algorithm is a metaheuristic inspired by natural selection process that can be used to solve both constraint and unconstraint optimization processes. Any genetic algorithm starts initialising a population using a specific initialisation set-up. From this moment onwards, the population “evolves” from one iteration to another using three main operations:
- Selection: Choose those members that will become “parents” for next generation.
- Cross-over: Combine parents to create children.
- Mutations: Modify certain parts of the parents to create children.
These three operations are very problem-dependent but they can be perfectly identified in Kaggle community.
The initialisation takes places just a few hours or days after the competition starts, when the first kernels appear. The community take a peek quickly to them trying to figure out how the rest of the community is approaching the problem. Thus, while there are no clear directions, the community behaves like an initial random search over the huge space of ideas.
This changes once the first results are published. Few ideas take then the lead and are embraced by the community as the main candidates for succeeding. They are selected to become parents for next generations and people start to use them as the foundations for any new development.
From then on, the other two operations take the lead, cross-overs and mutations of succeeding kernels are continuously published. Thus, if there is for example an encoding strategy that appeared in an initial successful kernel, it is easy to find it repeated again in later kernels from other contributors. In this stage, people try to combine good ideas from other kernels with their own ones.
Although this process allows the community to move forwards quickly towards the best possible result, it also shows how some people is just taking previous solutions as solid foundations without asking themselves how those solutions were reached. And I will illustrate this “blind” behavior with a couple of examples that I found in Porto Seguro’s competition:
- There was a popular kernel where the owner was using a new variable that had been created by combining other two features. Anything surprising so far. However, in the comments section another user asked why was he using that variable and how did he conclude that it was worth to create it. The owner just answered that he had seen it in a previous kernel. Finally, the other guy commented that he had tried to remove this created variable and the overall performance of the kernel had indeed improved.
- Many kernels were using a categorical variables encoding that had been published in one of the first successful kernels. We had not seen that encoding before, so we decided to compare this encoding with others and it didn’t really improved the results. We finally chose the one with the lowest dimensionality.
Of course, it is probably not a representative sample of the community who acts like this but analyzing in a critical way the other kernels to see if there are some ideas that can be integrated in their own framework. It was just surprising for me to find this “blind” trust in others’ work repeated along the forum.
Actually, there are too many things to be tested: encodings, algorithms, sampling rates, ensembles, etc. So it makes a lot of sense to see what other people is doing and grow from their work instead of starting from the scratch. This evolutionary behavior turns the community into a massive genetic algorithm that uses all the previous optimised solutions to refine more and more, at each iteration, the final solution for the problem.
Main differences with a real industrial problem
I like to define data science as the combination of skills and knowledge that allows to extract insights from data to solve complex problems. So, applying machine learning algorithms, coding the scripts and using the available statistical and mathematical tools is just a small part of the big picture. Do not misread me. It is very important to study, test and get practical experience with as many tools as possible. For sure, the more you know and the larger your experience, the better you will decide at each situation.
However, the most important thing in order to solve a real industrial project is to properly understand and model the problem itself. If you fail to do that, then all the algorithms and all the mathematical knowledge become probably useless.
This is the reason why we expend most of the time understanding our problems: How all the pieces interact? Which data should be gatherer? How can we sample and merge heterogeneous data sources? Which is the best target variable and how can we construct it? Is it better to get a more quantitative but interpretable solution or a more accurate one? etc.
These are just a bunch of immediate questions that might help to understand the problem, but many others keep arising during the whole analysis process. Then, depending on our understanding of the problem and the chosen approach we can select one or many algorithms to be tested. Sometimes, those algorithms are well-known and can be found in open machine learning libraries. Others, we need to research if there are some algorithms that can be adapted and/or implemented to fit even better our problem (approach).
Besides, the goal of the project can be as general as a root cause analysis of the quality deviations of a certain product. In this case we would aim to reduce those deviations by finding important patterns in the data. The specific score metric is meaningless here, and so it is the pure prediction. We are usually dealing with a trade-off between performance and interpretability and trying to remove any possible bias towards well known problematic patterns to focus on those that are still unknown.
For all these reasons, unlike in Kaggle’s challenge, the succeed of an industrial project does not usually rely on the use of an specific algorithm but on all the previous problem modelling work. Hence, this Kaggle competition was a completely different story.