Thanks Emmanuel. Nice framework and good tips.
Aamir Ahmed

Thank you Aamir!

  1. Unsupervised methods can be tackled in the same way once you have defined a good way to judge their performance. If you are clustering items for example, how will you judge whether your clustering works from a product standpoint? Once you are there, the rest flows nicely.
  2. I personally am gradually moving away from Notebooks for things that are not prototyping, as I have found it easier to iterate, test, and deploy using other tools.
  3. In general, yes. How you decide what goes in each split is crucial as we describe in the article. The actual percentage breakdown changes based on how much data you have.
  4. No matter how complex the model is, the answer is often in the data. Look at your input data, your pre-processed data, your post-processed data, your post-processed labels, etc…
  5. I would look at errors to see which dataset have limiting factors. In most cases, runtime per example does not vary between train and test.