Three reasons why a Good Data Scientist should be a Good Listener

Bemali Wickramanayake
4 min readMay 6, 2020

--

Short cut for feature identification for all Data Science problems

Feature identification and feature selection is one of the most crucial and tedious parts of the practical implementation of a machine learning model.

What if you were asked to mine for water on 10 hectares of land? You have the cutting-edge technology, and you are an expert in mining. However, you do not exactly know where to dig.

This is the exact context you will be put in as a data scientist when you are asked to solve a business problem with your expertise.

While using Machine Learning and Artificial Intelligence capabilities to solve business problems is going to be an edge for any business, it can be a very long shot if you lack the context and common sense about the behavior of the said business.

When we practice as a data scientist, we likely start playing with a given data set that is available for us, to solve a problem. We could mechanically use statistical feature exploration and establish what are the features we could use in our model.

However, in solving real-world problems, you need to literally ‘mine’ for data.

If we get back to our original example of digging for water; without the help of the business. One option available for you could start digging every square foot. That will not only make things complicated and time-consuming but will also make your expensive machinery to wear out. Not to mention the burn out for you as the expert digger.

Alternatively, you could ask questions from people who lived in the land, like;

  • Where have you seen plants grow often?
  • Have you ever tried digging and found that anywhere in the land, soil getting moist as you dig deeper?
  • Did you ever have a well here before? If so, whereabout?
  • Have you ever come across a Rock underground?

These questions will no means discount your expertise. But rather it will help you to understand the context of the business problem and establish important clues about where to dig, saving time and effort of both your clients and more importantly increasing the chances of you getting a good result.

Before even unpacking your gear, sit down and listen to the stories that Business has to tell you.

  1. Before even looking at the data, understand the context.

A few years ago, when I was the business user of a churn prediction engine, I helped the team of data scientists to reveal what are the early signs of a churning customer, and explained to them the crude methods we used to detect churn. That information helped them immensely to establish the initial set of features to train their model.

The tables turned 5 years later, when I had to develop a model that predicts the fraudulent/high-risk merchants in a payment platform.

Before getting into data I listened to all their stories. There were only six such cases for the entire history (although they caused a high cost towards the business), but understanding how each of them behaved helped me to establish the red flags which could identify them in the future. If not for their help, I would have to test with a vast range of variables and features, and chances are I don’t have a successful result to date.

2. Context can be different from business to business

You can be a seasoned data scientist with many years of experience, solving real-world problems. Still it is imperative that you listen to the business.

Not every industry behaves the same, and the same goes for every business.

If you are going to develop a churn prediction engine for a client, even if you had developed tens of such models before, listen to the problem carefully.

Why the customers churn can be due to a unique situation relevant to them. The stories they have to tell you will guide you towards the exact data points (features) that will help you to build a better model.

3. Starting with too many features will over complicate your model and will cause overfitting

Even in a context that you have a feature set readily available, narrowing down the feature set will help you to build a better performing model.

When you use too many features most of the supervised models go into overfitting. That is, it performs very well for training data, but fails to predict accurately for test data. Models do this by memorizing the features, and it is easier for it to do so when we provide too many features.

One way is doing this is testing subsets of features.

The easier way is, get the business to pick the features they think would become important. Although it does not take you the entire stretch, it will take you at least seventy percent.

Data Science is not a science that you could master with a set of hard and fast rules. It is somewhat an art. It needs a great deal of listening and empathizing skills towards the problem you are trying to tackle. Resonating with and internalizing the problem will only help a good data scientist to think out of the box and understand where you need to go on and mine.

--

--

Bemali Wickramanayake

A business strategist and a self taught data visualization expert. Runs a business of helping other businesses to make better decisions with data. And a reader.