Measure twice, cut once — Questions before starting with machine learning

Semantics3
The Ecommerce Intelligencer
4 min readFeb 28, 2017

With many articles promoting machine learning as the magic bullet to solve tough problems, the pressure has never been higher for data scientists and software developers to deliver impressive AI systems for their own tasks.

However, the path to being a successful practitioner is a long journey and it is better to be aware of the pitfalls and tread lightly before rushing in.

In this post, I would like to list the questions that I prefer to ask at the start of every project. More often than not, clearing these upfront has helped save valuable time in later development stages.

Firstly, what do you want?

As obvious as it may sound, the most important step is defining a valid problem statement. Starting from a generic directive, “we want to improve sales”, is not good enough unless you are also able to list the metrics by which you will measure yourself.

Listing the available tools also helps you develop a feel for the problem. While what questions are most common in this stage, it might also be useful to ask the why questions as well.

1. What problem are you trying to solve?

  • We want to forecast our monthly sales using visitor metrics, or we want to identify churning customers based on their buying behavior

2. Why do you think this will help?

  • Will optimizing these metrics help? how will identifying a correlation between education level and tshirt sizing translate to an actionable task?

3. What existing problems are most similar?

  • Is this a straightforward classification problem? or should I cast this as a regression task over specific features?

This process also helps you refine your skills as a data scientist. Casting irregular requirements into tractable optimization goals is a valuable skill, not directly taught via most tutorials or coding workshops.

I have come to believe that the biggest value an expert provides is in their ability to pattern match a given problem against their learnt repository of past experiences. By identifying reliable starting points, and close parallels to previous problems, valuable progress can often be achieved with just minor tweaks to existing solutions.

What do you have?

If you wish to increase the visibility of your blog, but do not measure the number of visitors, then you have problems bigger than not using machine learning.

For every problem that you wish to solve, it is worth the time to make sure that you have the necessary data behind it. Also, collecting more data could drastically improve the options available for later stages.

A skill that will be expected of most data scientists, is being able to develop an intuition about the underlying data. Even simple questions like the ones below, can help you avoid painful realizations later.

1. What assumptions were made during collection?

  • We only have self-reported income numbers from college graduates, but want to estimate salary growth

2. Are the samples representative of the real-world?

  • We surveyed people with high-speed internet access, but would like to predict global internet coverage

3. Are the features measured distinctly useful or redundant?

  • We have the temperature in Celsius, Fahrenheit and Kelvin, can you forecast local weather patterns from just these

What solutions would be acceptable?

As in many other cases, you can’t always get what you want. Being able to identify trade-offs is another important skill to make sure you don’t waste time chasing fictional whales.

I have often seen engineers hesitant to share their not-yet-perfect solutions. They are also the people that are frequently surprised by the level of manual work that non-technical people put up with. You might consider your model which reduces manual tagging by 20% a failure, but it might be still be a huge improvement for the actual end users. Again, ask! or you will never know.

This section also ties in to the earlier part about defining your problem statement clearly. If you had spent sufficient time reasoning about your problem statement, these questions should be fairly straightforward. If not, back to the drawing board then.

Before you fire up your engines

This is just my take on aspects that I found useful to deliberate on before embarking on a solution. While the examples here are specific to machine learning, the processes might still be ripe for transfer learning to other domains as well. Remember, you are almost always forced to pick two from { fast, good, cheap } — weigh them carefully before jumping in.

So, for your next project, before firing up your editor of choice, spend some lead time just thinking. The questions that you end up asking might present whole new angles on your problem — leading you to develop better solutions, one question at a time.

Written by Ramanan Balakrishnan and the Semantics3 Team in Bengaluru, Singapore, and San Francisco

--

--