You probably know the famous scikit-learn algorithm cheat sheet. For fun, I revisited it a bit …

An Extended Version Of The Scikit-Learn Cheat Sheet

You probably know the famous scikit-learn algorithm cheat sheet. This is a kind of decision tree, helping to figure out what machine learning algorithm to choose, depending on the type of problem you have : classification, regression, etc. …

Now that I’m doing “real life” data science within an organization, and not only “challenge” data science, I realize that before applying this cheat sheet, a lot of steps must be overcome. So, for fun, I added some preliminary stages, that I called “complication”…

Legal Clearance

Before doing anything with data, you must ask yourself : am I authorized to do it ? The question is easy, but the answer is not. Indeed, it may depend on the country, the domain, the usage, etc. …

Generally, when you use collected data for a “primary” use (e.g. : Uber for booking a cab), there is no special problem. It becomes more legally tricky when you start using the same collected data for a “secondary” use (e.g. : Uber using your data to understand your habits, as exposed here).

Data Access

To do data science, you need data. Makes sense.

However, getting the data is sometimes painful, particularly in large organizations with legacy IT. Indeed, data is generally not centralized, but fragmented in several places. What’s more, for security reasons, accessing the data may be challenging : you must be authorized, enter passwords, or go through firewalls.

So, for fun, I added some preliminary stages, that I called “complication”

Data Understanding

Data can be taken “as is”, without understanding what lies behind, and considered coldly as a bitstream. It’s possible to do good machine learning on non-explicit data — look for example at some Kaggle challenges, where you don’t know the meaning of the given variables (sometimes you even don’t know if they are categorical or numerical, or they are hashed for confidentiality reasons).

Look for example at some Kaggle challenges, where you don’t know the meaning of the given variables

However, understanding the meaning of the data can help in two things :

making a model that is interpretable — human nature is reassured by a predictive model he can understand, as mentioned in a previous post
guiding your intuition — it will help in the feature engineering work, to create smarter variables

But accessing the data dictionnary is not always easy. Let’s say you want to know what is this column named GRAG_PPK_NEW2 ? You have to become a data detective, searching clues in dozens of excel files, or finding someone in the company who knows about it, and questioning him…

Data Cleaning

When you finally achieve to get the precious data, it is often not very clean. Before applying machine learning on it, you must put it in a suitable format.

This is well known to be a very time consuming task, representing a large part of an overall data science project.

Dates, for example, are particularly curious fields, where we realize how imaginative the humain brain can be.

By the way, what is the question ?

Data scientists love solving complex problems. May they be useful or not.

That’s why, before running the algorithms, and bringing out the big guns, it’s good practice to double check everybody is aligned on the target.

What do I want to predict ? On the whole population, or a sub-segment ? How to convert the model performance into value ($$$) ? Am I trying to improve an existing model, or to invent a brand new one ? …

So, you arrived at this stage ?

Congrats ! Now, have fun with scikit-learn !

For fresh data stories, you can follow me on twitter : @chris_bour

You may also like to read :