Asking the right question — advice for beginning data scientists

Mr Data Insight
3 min readOct 5, 2018

--

Enthusiastic new data scientists often wonder how they can improve their models, which fancy programming language they should focus on, or how many layers they need for a deep neural network. These are all valid concerns, but hardly the most important things to worry about when you are just starting out. Not only in the beginning, but through your entire career, the most important aspect of modelling is actually the value proposition: does it make sense to spend time and resources on a given business problem?

Part of the equation is defining what you are actually trying to model and what you expect to achieve. Classic business cases such as ‘which customer will churn’, ‘what is the expected price for this house’ or ‘how many clients will buy this product’ all have one thing in common: the end goal of the business is not knowing that exact number, the end goal is optimizing revenue and to do that, they will use your model output to guide business decisions.

You are not alone

At this point I like to refer to my favorite quote from the Netflix series ‘Lost in Space’:

Information does not exist in isolation. There’s consequences. Context.

Your model will always be used in context. It is important to know which context. Your model is not the end goal, there is always a reason for its existence.

From context understanding to asking the right questions

So why do you need to know the context: to ask the right question. I shall clarify this point with a few examples.

Lets say you want to predict housing market prices, why? Do you want to buy low to sell high and become a realtor? In that case the question you should be modelling is the difference between how high you can expect this house to sell and the actual price you expect to buy this house for right now (hopefully before this house enters the real estate public market).

Or do you just want a house to live in yourself? Perhaps you value a garden and a close proximity to public transport. In that case you should model a cost function which takes into account the ratio of the expected price and how well it matches your preferences. Note that in this case, the actual housing price accuracy becomes less strict.

Or you could try to cluster an entire city into lower-class, middle-class and upper-class streets. In this case, the actual value of the house doesn’t even matter anymore, you only need to categorize them into buckets or clusters.

Conclusion

By asking the right question, you can make your model work better in the overall business context. When the question you were trying to solve closely matches the context your model will be used in, the overall performance will be so much better than if you invested a lot more resources into getting an absolute perfect model that models the wrong question.

As a side benefit, if you can reduce the accuracy requirements, you may get away with smaller datasets and use the free time you gain for doing other experiments

Happy modelling!

--

--