Building Machine Learning based Products

Part II: The Data Science Process

15 min readDec 12, 2016

Damian Young playing a Data Scientist in Netflix’s House of Cards

This is the second post in a series of posts regarding building machine learning based products.

In Part I, I outlined an example of machine learning based product, as well as the differences between analytics/business intelligence, data science and machine learning.

In this post I will explain the fundamental steps a data scientist may take to build a machine learning based product. This may act as a guide for those working with data scientists such as product managers to understand the process a data scientist may go through whilst trying to build part of a product. It will also introduce the reader to some common terminalogy and concepts used by data scientists.

Finally in Part III I will explain some of the steps towards productionising a solution, and how you may wish to test your product.

I’ve tried to present this content in an accessible way. If you have any comments, suggestions or queries please feel free to add a comment.

Scenario Recap

In Part I, I described a scenario whereby we have an objective to try and increase the conversion rate for an online shopping site.

A related products module on an e-commerce site

We briefly defined a probable machine learning product as a related product’s module. Whereby we are trying to understand how we can offer great alternative products (clothing items), to a user, which they are likely to purchase, and are in some way similar to the current product a user is viewing.

An Introduction to the Process.

There are several steps a data scientist will take as part of their process. Some of these steps are quicker and easier than others, some have to be repeated depending on the outcome of data availability, data quality and the output of previous steps. It is important to also note that in some cases you will have to abandon the solution depending on data availability, quality and results derived from this process.

This is an exploratory and creative process.

Whilst this process lends from software engineering tools and approaches, it is important to remember it is not software engineering. It is therefore imporant to not treat it as a traditional software engineering based process.

There are 6main key parts of the process, which are listed below. In the remainder of this post I will explain each part in more detail, and how they related to the previous parts of the process.

Understanding the Requirements.
Research Appropriate Models / Algorithm.
Obtain Data and Understand the Data Sources.
Feature Generation & Dimensionality Reduction
Train a Model
Evaluate your Model

Step 1: Understanding the Requirements

The first and most important step is to understand the requirements. Understanding what the problem we are tackling is and why it is important to solve. It is very important to understand what the behaviour we want to influence or allow is, and to be able to quantify this objective. For example we want to increase conversion rate by x%.

This should all be understood before proceeding. This should be clearly defined and communicated by the product manager / product owner / delivery manager to the data scientist. A discussion should take place between these parties, to make sure a shared understanding is achieved, and that from a technical perspective it is feasible.

Step 2: Research Appropriate Models / Algorithms

There is not point re-inventing the wheel.

Like any good engineer, a good data scientist should be able to look at a large problem, and break it down into several smaller problems. By breaking down these large problems into a bunch of smaller problems, a data scientists should be able to see the “type” of problem they are facing, and should be able to see what type of models and algorithms are applicable.

A cheat sheet for data scientists, using scikit-learn based on the type of problem they are facing, and the data at hand.

There are several types of problem, which machine learning lends itself to, in the above image we can see some of these problems. Depending on the amount, type and quality of data available to the data scientists, they can choose from some of the methods listed above. It is important to note that the above is not a complete list, and should be used as a guide.

It is also important to see how others are approaching similar problems. This is normally done by looking at academic publications, but also from research blogs such as those listed below:

Depending on the industry you are currently in there, are probably more relevant or less relevant blogs. For instance in terms of fashion the following two are highly relevant:

Whilst the blog probably won’t give you all the details, they may act as a good starting place to see what others are doing in the industry/space that you are operating in.

At this point, I would recommend that you and your data scientists sit down, and they take you through what others are doing in the industry, and they propose an approach to solving the problem. They should do this in as much detail as possible.

The above can be a big ask from your data scientist(s). Though it is an important thing for you both to do.

The reason it can be difficult for a data scientists, is because there are many decisions to make throughout the process of building a model, and decisions will be made based on the results of the following steps described in this article.

Creating Tickets or a Plan of Action

I would recommend at this point you generate a list of tickets, or a workflow with some estimation of complexity or time boxing (for research based tasks) around the proposed tasks.

I would highly recommend having a “back stop” ticket. This is a ticket which has a size of 0. The ticket is to indicate that an MVP has been delivered. I would recommend you generate tickets, to get you to your backstop ticket.

You should sit down with your data scientists and get them to estimate tickets in terms of complexity to getting to the backstop, not necessarily days to complete.

This will allow you to see how your data scientist is progressing through their process. It also lets the data scientist understand their strengths and weaknesses, for instance they may excel at certain parts of the process, and take a little longer at other parts.

By including other data scientists in the process a shared understanding of the complexity and the work can be created. Alternative suggestions of approaching the problem can brought forward here.

By creating a backlog of sized work, you will be able to see a possible route to delivering the data science part of this work, you will also get to give estimates in terms of delivery once your data scientist starts their work.

Plan of Action for Generating Related Products, which will increase conversion.

To deliver the desired product we need to the following, we need to know what products are similar. We might also want to know what products convert well, and display a combination of these.

We could use techniques such as collaborative filtering, content filtering or a hybrid of the two to understand what products are similar. For the ease of complexity we will use collaborative filtering, this is highly likely to under perform against a hybrid, though for the ease of explanation we will use this method.

Collaborative Filtering — A Really Brief Explanation

Collaborative filtering is used to suggest items based on a users past interactions, and the interaction between items and other users.

Say we have the following table showing brands people have viewed on a fashion site:

| User / Brand | Burberry | Michael Kors | Nike | Adidas |
|--------------|----------|--------------|------|--------|
| Alice        | Y        | Y            | N    | N      |
| Bob          | N        | N            | Y    | Y      |
| Chloe        | Y        | N            | Y    | N      |
| Denise       | ?        | Y            | ?    | ?      |

In the above table we can see that Alice, has viewed the following brands, Burbbery and Michael Kors (indicated by Y), though has not viewed products from Nike and Adidas (indicated by N).

Image we are trying to suggest brands to Denise (she features in the bottom row). Now we know that Denise has viewed Michael Kors before.

However, we need to suggest a relevant brand for her. If we look at users similar to Denise, we can see that Alice is similar in that she, has also viewed Michael Kors.

Since Alice and Denise are (possibly) similar, maybe we should use Alice’s viewed history to influence what Desnise is suggested. Alice viewed Burberry, so we can assume that Burberry might be a good brand to suggest to Denise compared to Nike and Adidas.

Now, rather than taking brands viewed history, for this use case we will want to suggest products, and rather than what people viewed, we might want to suggest using purchase history instead.

Our hypothesis here is that by suggesting products which were bought by similar user, the user at hand might be entinced to buy these products as they are personalised to that user.

I won’t describe the actual steps involved in to producing a product here, as this is the purpose of this article.

Step 3: Obtain data and understand the Data Sources

Most machine learning problems can be bucketed into one of the following categories; supervised learning and unsupervised learning. However, there are a few more that exist but for simplicity I will explain these two:

Supervised Learning:

This is where you are trying to predict something based on learning from a set of labelled data.

A classic example of supervised learning is email spam detection, whereby you show the machine a set of messages which are marked as spam (labelled data) and another set of data which is marked not spam (this is also labelled data). The machine tries to learn what makes the emails spam or not, and then should be able to predict given a new email if it is spammy email or not.

If you’re looking for a really good farily non-technical explanation of supervised learning I would recommend watching the following video:

Udacity — University of Georgia Tech — Machine Learning — Supervised Learning

Unsupervised Learning:

Unsupervised learning allows us to find patterns or structure in data where there is no labelled data.

An example of unsupervised learning is clustering, say we want understand the users of a social network, we could apply an unsupervised approach to understand features about different users using our social network. E.g. we might have a cluster who talk about technology, we might have a cluster who talk about movies.

If you’re looking for a really good fairly non-technical explanation of unsupervised learning I would recommend watching the following video:

Udacity — University of Georgia Tech — Machine Learning — Unsupervised Learning

Now that we have understood the task, and researched possible solutions (in this case collaborative filtering) we need to obtain some labelled data (as this is a supervised learning task), and check that we fully understand the data we’ve collected.

This is probably the first major place of potential danger for a machine learning product to fail.

If you are trying to use supervised learning algorithm, and you don’t have enough data to train it with or the data you try to train it with is unreliable then this is where you should possibly just cut your losses.

In terms of how much data you need to be able to train a model, it depends on the complexity of the task, though the general rule the more quality data the better.

I would recommend you and your data scientists watch the following video regarding problems you can encounter when trying to make machine learning products.

Ben Hamner from Kaggle gives words of warning, about common mistakes encountered when doing machine learning.

It is important you understand what data you are feeding into your model, as you don’t want it to learn a key predictor, or to learn a preset shape within your data.

Once your data scientist has gathered a dataset and understood all of the data they gathered, they will split the data into three sets;

Training set
Test set
Validation set

These data sets serve different purposes though are ultimately what the machine will learn from, and what we will judge how well it performs against.

The format of these files will be similar, in that they will contain some features, and a label. For instance in the spam example you may have a file which looks like the following:

| Email Subject     | Email Message                 | Is it Spam? |
|-------------------|-------------------------------|-------------|
| Sign up           | Hi Jon, thanks for signing up.| N           |
| $5million dollars | Good Fortune, you have won    | Y           |
| PAIN Killer$      | Want free pain medication     | Y           |
| Re: Birthday      | Hey Jon, Looking forward to   | N           |

The input features are the email subject and the email message, the label indicates whether it is spam or not.

Below we offer a brief explanation of each of the functions for each of these datasets.

Training Set:

The training set is a dataset to which the machine initially learns from.

Validation Set:

This data set is used to tune the results. It is used to understand if you’re overfitting or under-fitting (we will explain these concepts further in this article).

Test Set:

This data set is used for evaluation purposes. It is used to confirm the actual predictive power of the machine.

These data sets are normally split on something like a 60/20/20 basis. However, this ratio is not something set in stone, normally the training set will have the highest of the ratios.

The data we may use for the related products module could be something similar to the sample found below:

| User Id | Purchased Product Id |
|---------|----------------------|
| 1       | 45                   |
| 1       | 49                   |
| 2       | 49                   |
| 3       | 88                   |

Where the first column is a lsit of unique identifiers for users, and the second column is a list of unique product identifiers for purchased products.

For example in the above data there are 3 users(1,2,3), who combined have bought 4 products(45,49,49,88), 3 of which are unique (45,49,88). Product 49 was bought twice, once by user 1, and once by user 2.

Step 4: Feature Generation & Dimensionality Reduction

Feature generation and dimensionality reduction are optional steps, though could be very beneficial.

Feature generation is the process of taking unstructured data and defining features for potential use.

An example might be given that you want to perform a classification task on text (say sentiment analysis), you may want to generate a cleaner corpus to use. To do this you might want to add some rules, e.g. removing stop words (these are common words found in the root language e.g. ‘the’, ‘is’, ‘a’, ‘an’, ‘of’ etc. which might have low value), or you might apply some other rules.

Dimensionality reduction is a way to find which of the features contribute most towards predicting the outcoming value. It can be a useful process as it can:

Reduce the time and storage space required which training and running your models.
It can also aid performance of the model.
It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.

An example of dimensionality reduction:

Say you are trying to predict house prices, you may find that size, location, age and condition of property are possibly good indicators of house prices.

However, we could also factor in number of lights in the house, this might be a relative indicator of price, though we can probably estimate house price with less error based on the previous factors.

In this article we won’t cover any more around feature extraction/selection or dimensionality reduction as to keep our related products module as simple as possible.

Step 5: Train & Test your Model

This is where you take your datasets (Training & Validation) as well as your ML algorithm (we stated we would use Collaborative Filtering in step 2), and combine them to make the machine learn.

The term ML model refers to the model artifact that is created by the training process.

The point of this step is for the algorithm to find patterns in the training data that we provided it with, which will allow the machine to find correlations with the target attirbute that we want to predict.

The correlations and weightings for these factors are the outputs; an ML model that captures these patterns.

This is where you have to be careful that your training and validation datasets don’t allow for over or underfitting. Think of this as a goldilocks and the three bears kind of situation.

A basic Introduction to Underfitting and Overfitting

Lets say we are trying to predict house prices, and to do this, we are going to create “a line of best fit” to predict house prices.

Say we have the following data:

Graph showing : House Price vs. House Size

As you can see the value of a property goes up as the house size increases. If we were to draw a straight line we could capture this trend, and might act as an OK predictor for house prices. We can see an example of this below:

Graph showing House Price vs. House Size, with a line of best fit.

As we can see, we can show that as house prices go up, so does the size of the house. However, if we look at the accuracy of the line, it is not that good, as it is not complex enough to capture all the prices and sizes accurately. This is underfitting, you have built a predictors which is too generalised for the task at hand.

There is another case you want to look out for and this is called overfitting. This is when you build a model, which will create a function that hit exactly the training data every single time.

The problem with this is, if we were to put a new point on this graph, the model would possibly have a very large amount of error. A visual example of over fitting can be seen below:

A graph showing Houses Prices vs. House Size, with an overfitted predictor.

What you want to achieve is a solution whereby if you were to introduce new data the predictor, would perform well, and would cause minimal error. Therefore a function like the one below, where it is simple and nearly hits all of the training data is ideal, as this is a representative predictor.

A represenative predictor for House price vs. House size

Step 6: Evaluate your Model

Once you have trained your model, and made sure that it is neither under or over fitting the data. You will want to know how well it performs.

The evaluation step allows you to understand the predictive accuracy of your model. This is where you have to understand if this is good enough for you to productionise the model, or if you want to try taking another stab. You can do this by either changing the model (Step 2), changing the input features(step 4), providing more data (step 3). Alternatively you may just want to quit, as you have exhausted your options, or the effort required has surpassed any impact it will have.

There are many metrics and visualisations that a data scientists will use to communicate the effectiveness of the model. The choice of metric is dependant on the task and ML model that your evaluating.

This step is called offline-testing. It uses the data in your test set to evaluate how your model would perform in the real world without ever showing it to a user.

Online-testing is when you use the output of your model with real users, i.e. running an A/B experiment showing results generated by a ML based product vs. those where the results are picked by hand or some business logic.

Offline testing, may not correlate exactly with online testing as humans are irrational and spontaneous things, we don’t always follow a set of instructions, we get distracted, we like new interesting things, but can get bored easily. Offline testing gives you an indication of if something should perform. However, a 15x increase in a metric offline, won’t necessarily mean you will get a 15x increase when you test it online (it could be worse or better).

In terms of metrics, I would highly recommend you look at the wikipedia article on confusion matrices.

Confusion Matrix taken from Wikipedia.org

The confusion matrix, offers a variety of metrics to judge your model against. It is particularly useful for classification tasks. It allows you to compare what your machine learning model should have predicted vs. what it actually predicited.

You might also want to learn about precision (Positive Predictive Value), recall (True Positive Rate/Sensitivity) in more detail (wikipedia article).

For binary classification tasks (i.e. does the image have a dog in it, yes or no) ROC (Receiving Operating Characteristic)curves and AUC (Area Under the Curve) may pop up. They are a way of visualising results, rather than just numbers.

If you have decided that the model performs well enough, you now need to productionise it. We will cover this in the next article. I will finish this article with one more section to summarise the steps, and the possible paths a data scientist may take.

Summary

Below you can see a simple flow that a data scientist may take. This flow captures the six steps mentioned throughout this article,e as you can see there are many paths which lead to not producing a model due a number of reasons.

A simple workflow for a data scientist to follow.

If you have any thoughts or views on this flow, or the ideas in this post please feel free to leave a comment.

In part III I will describe once you have created a model, what some of the steps an engineer or datascientist may take to get this product in front of a user.