Supervised machine learning for consultants: Part 2

Joe Feldman
Cervello, a Kearney Company
8 min readNov 10, 2020

Clean data: the foundation of machine learning

In my first article, I broke down some of the machine learning terminology that’s thrown around these days, distinguishing between AI, machine learning, and predictive modeling as related but separate entities. I spoke about the differences between unsupervised and supervised machine learning and separated models from algorithms. The goal was to convince you that using supervised machine learning for prediction might be worth your while.

But here’s the catch: just because machine learning is better doesn’t make it easy to implement. There is no magical open-source software where you can throw in any old data and get powerful models built with machine learning. Using machine learning properly requires a lot of attention to detail, but I promise you the reward is worth the work.

In this post, I’ll discuss how clean data is the foundation of any machine learning. Ready for another drawn-out metaphor? Any tasty dish starts with great ingredients, right? As we’ll see, getting your data in the right format — clean data — is the most crucial aspect to effectively use machine learning, just like cooking with fresh ingredients is the best way to make delicious food.

What are we cooking?

So, let’s start from the beginning. You’re handed a data set, and you’re asked to build a model to predict some quantity in that data set. This might be for a client or for an internal project. Either way, a model that can offer strong predictions will optimize your business process in some way.

To bring these concepts to life, I’ll be using an example data set that is publicly available. It’s comprised of the historical sales for a product in 200 markets along with the manufacturers’ advertising budgets for TV, radio, and newspaper in each market (see table 1).

Table 1

Suppose we’re on the marketing team at this company. Our task is to predict product sales based on advertising spend. An effective marketing campaign could be built using this model. Since we need to build a predictive model, we can home in on supervised machine learning for this problem.

This means we need a little math. The good news is that we’re not doing the calculations; that’s what the computer is for. However, we are going to need to understand at a high level what we want the computer to do. So, let’s break down the model we’re trying to build with this equation:

Translation: The model that we’re going to build is a function that takes as inputs the spend in each medium for a particular market and spits out the prediction of sales for that market.

But I’ve left out that pesky epsilon

The epsilon is the random error: we’re never going to be able to perfectly predict sales (or else we wouldn’t be working consultant’s hours), so we acknowledge that our model for sales is subject to some random deviation, or noise, that we can’t explain.

So, the machine-learning task seems simple enough: let’s learn the function that most closely minimizes that random error by finding patterns in the data. The way we’re going to learn that function is with an algorithm. But most algorithms in supervised machine learning require very specific structures in the data before they can be implemented. Before you start cooking up your model, make sure you have all the right ingredients.

The shopping list of ingredients for machine learning

1. We need complete observations in our data. For every observation of sales, we need matching observations for each of the drivers that we plan to use to predict sales (see table 2):

Table 2: A Data Set with Missing Observations

The reason for this is simple. Computers are stubborn. Our algorithm programs the computer to think strictly that the function it’s going to learn takes all of the drivers as inputs. In the above snippet of the dataset, we see that there are several missing values, signified by NA entries. If there are incomplete or missing observations in the data, our algorithm can’t use them, and they must be discarded.

2. We need a lot of data. The true advantage of machine learning lies in the algorithm’s ability to recognize nuanced patterns in data. In our case, the algorithm is learning a potentially complex relationship between sales and advertising spend in different mediums. So far, I’ve only shown a snippet of the dataset, but in the full set, there are 200 observations (or rows), which correspond to the 200 markets for which sales and advertising spend were recorded.

This is a solid amount of data, but we can always use more. As humans, we make important decisions by collecting as much data as possible. That logic carries over with machine learning. The more data, the more confident the algorithm is in its recognition of intricate patterns, and the better the model is. While there is no cutoff, live by this maxim: “The more, the merrier, the better.”

We can grow our data set in two ways: We can collect the same information in more markets, making our table longer. Unfortunately, this is not always possible. Or we can make the data set wider.

To make it wider, we can collect information on additional drivers, such as whether there was a promotion for the product in a given market. Again, this may be difficult since you will have to comb external data sources for such information.

The more common way to widen the data set is through feature engineering, which means we can create additional drivers by taking functions of existing drivers. In table 3, I have engineered the feature (driver) “total spend” by simply adding TV, radio, and newspaper spend.

Table 3: Total Spend — A Newly Engineered Feature

Feature engineering is a crucial skill in both supervised and unsupervised machine learning. If we can cleverly engineer new drivers from existing data, we save time in data collection while simultaneously giving our algorithm more information that will help in our estimation of that unknown function.

Although there are an infinite number of features that you could engineer, we need to be careful. As a general rule, and for most practical applications, we must make sure that when using supervised machine learning, our data set is much longer than it is wide — we should always have many more observations than drivers. Next, we should only engineer drivers that will contribute information to the model we’re going to learn, which nicely transitions to the final item on the list.

3. Make sure the drivers are really drivers. It’s vital that the drivers we have identified actually exhibit some sort of relationship with sales. If we input into our model a bunch of drivers with no empirical relationship to sales, such as average height of customers in the market, we can’t hope to build a good model with machine learning.

The simplest way to confirm the presence of meaningful input versus output relationships is by simply plotting inputs against outputs. On one hand, an indication of no relationship might be seen through a random, cloud-like scatter (plot 1) On the other, the presence of some noteworthy relationship will be clear — there will be some structure to the plots (plot 2).

Plot 1: No Relationship between Avg. Customer Height and Sales

The exploratory plot confirms we should not use average customer height as an input into our model. Below, we see that there is an association between advertising spend and sales in each medium: if you spend more on advertising, you generally see an increase in sales. This relationship is strongest with TV spend and weakest with radio spend.

Plot 2: Noteworthy Relationships between Advertising-Spend and Sales

Someone has to chop the onion

So, I’ll be honest: data collection and formatting are the least sexy and most boring aspects of this process. But good data is also the essential ingredient to powerful machine learning. In this post, I’ve presented a particularly nice data set to motivate the process of data collection and feature engineering. There were no missing observations (besides the ones I deviously planted for the purpose of example), and the drivers exhibited present relationships with the predictive target.

When you decide to use machine learning for your prediction task, I can almost guarantee that the first data set you look at will not be as manicured as the one I’ve used here. But don’t fret. Spend time getting your data set into the right form as I’ve laid out above, and machine learning will bear the fruits of that labor. Remember: always opt for more data, make sure you have complete observations, and cleverly yet cautiously engineer more features to bring information into your model.

If you’re reading this blog, you probably already had some idea of the importance of data in the business world. Now, you understand the value of good, clean, and thorough data within the context of machine learning. At this point, we can commence building our model, and that starts with — gasp — an algorithm. In my next post, “Algorithms: the engine behind machine learning,” I’ll break down these recipes that seem to run the world. Then, the fun really begins.

About Cervello, a Kearney company

Cervello, is a data and analytics consulting firm and part of Kearney, a leading global management consulting firm. We help our leading clients win by offering unique expertise in data and analytics, and in the challenges associated with connecting data. We focus on performance management, customer and supplier relationships, and data monetization and products, serving functions from sales to finance. Find out more at Cervello.com.

--

--

Joe Feldman
Cervello, a Kearney Company

Joe Feldman is a 3rd year Ph.D. student in the Department of Statistics at Rice University and Data Scientist for Cervello, a Kearney Company