The Art of Data Science
Data science is much of an art as a science. Building a model with top performance in clean datasets with well-defined goals is super cool.
But the real test is dealing with the messy world, full of biases, dirty data, and fuzzy or unspecified goals. To survive you need more than a good model. You need acute intuition, good communication skills, scientific skepticism and lot of tricks.
Bias, unreliable data, inconsistencies and difficulties are not always obvious, but the sooner you spot them, the faster you save time before wasting your time in an useless model.
I sorted these problem into 3 categories:
A) data collections and data cleaning issues,
B) hidden biases,
C) models validation and stability.
A list of some of the most common problems of category A:
1. unbalance data. Problem: you have vastly more negative cases than positive ones (for instance in fraud detection, which is typically about 0.01%). Solution: there is no simple solution, but subsampling and oversampling are common tricks. My advice: i) Pre-train an unsupervised model to capture the latent distribution of the data and fine-tune it with labeled data — for instance with a Deep Belief Network (DBN); ii) use zero-shot learning techniques.
2. More may be less. Problem: adding more attributes not always guarantee better models. Why? Because you are adding extra complexity to the model, increasing the search space. And don’t rely on regularization — it will not help. Solution: start with as few features as possible and gradually increase the complexity of the model .
3. Curse of dimensionality. This is a well known Problem: as you increase the number of variables the size of the search space increase exponentially. For instance, you have 25 000 products on an e-commerce website, but 85% of sales occur on less than 1%, what do you do? Build a matrix 25 000xN where N is the number of users visiting your website? Never! Solution: Start with the 1% and go from there. Maybe later you can create two models for different categories: the top 85% and the rest.
4. Training data. Problem: can the model learn anything at all? Are there inconsistent labels (non-unique mapping)? that will make the problem too hard to solve? Solution: before writing a single line of code, do some descriptive analytics and check if there is some signal to be learned. Sometimes simple tricks are very efficient, like model inversion (swap the inputs with the outputs) to create a one-to-one mapping.
5. Pre-processing and pos-processing: solution — simple tricks, like inverting or logarithmizing the data can make a huge difference.
6. Dealing with many attributes. Problem: some variables can have thousands of levels, like the postcodes. Algorithms, like Random Forest can not handle these types of variables or will be strongly biased by them. Solution: apply any tricks to reduce the number of attributes or categories. If it’s a classification problem, a simple solution is aggregate categories by classes using a simple k-means algorithm.
7. The many faces of overfitting. Problem: there are many ways in which a model can be overfitted or not properly tested. Cross validation is always mandatory but it do not necessarily put you on the safe side. Solution: always be sceptical about predictions. Validate them in different conditions. Test how the model behaves with inputs slightly outside the learned domain. Test the robustness of the model and make sure the real data is as close as possible to the training set.
8. Data is non-stationary. Problem: the data used to train the model is different from test and changes over time — for instance, the marketing campaign are targeting new types of customers. Solution: apply a stationarity test as often as possible to check is the probability density distribution is the same as the one presented in the training data.
(to be continued)