The secrets to “good” machine learning : a few practical life hacks

1- Start with descriptive statistics & graphics :

Uni, bivariate, multivariate (PCA). Do some dataviz and try answering theses questions about your data :

  • What is your data ?
  • How does it look like ? Does it have a special shape ?
  • Are there any obvious structures or relationships ?
Pic is not related at all, but I find it really nice.

2- Do a fast check for leakage

Data leaks are still weakly known even though they’re actually fairly common. They can cause some funny model behaviour like good performance during development and poor accuracy in production. Here’s a very nice ressource on the topic :

3- Keep the right features only

Selecting the right features in your data will make the difference between passable performance with long training times and great accuracy with short training times. The steps here are the following :

  • Remove Redundant Features. Some features don’t offer any new information. Thou shall delete them.
  • Rank Features by importance (using Random Forest for example) to understand what variables are the most linked to the one you are trying to predict.
  • Use a Feature Selection procedure to eliminate useless data. Parcimony is awesome :) you will make your model simple and avoid overfitting. Keep the model generalizable.

Here is an excellent and very clear story on the subject :)

4- Choose your model carefully :

Test many of them. Every model you run will tell you a different story. Stop and listen to it. They are all interesting.

Look at the coefficients. Look at the metrics. Did they change? How much ?

When you pause to do this, you can make better decisions on the model to run next.

  • Always check if your model can be generalized. Then, check again. Obviously, you must always test on a separate data sample.
  • Get yourself the right toolbox : Caret on R and sklearn for Pythonistas are essentials that will get you there quite fast !
(I heard IBM still makes it !)
  • Keep learning from the right sources.
  • The research question is central, keep it in mind. Especially when you have a large and rich data set, it’s very easy to get lost or distracted. There are so many interesting relationships you can find. Months later, you’ve tested every possible predictor but you’re not making any real progress.

Keep the focus on your destination: the research question. Write it out on a post-it and stick it on your office !

You’re all set now. Don’t forget to leave a comment and give ❤️