#2: What You Need to Know About Machine Learning Algorithms and Why You Should Care
This is part 2 of the 6-part tutorial, The Step-By-Step PM Guide to Building Machine Learning Based Products.
We previously discussed the type of business impact ML can have. Now, let’s review all the technical terms you need to know to effectively work with a data science team and help them generate the greatest impact for your business (or at least sound like you know what they’re talking about).
Algorithms, Models and Data
At a conceptual level, we’re building a machine that given a certain set of inputs will produce a certain desired output by finding patterns in data and learning from it.
A very common case is for a machine to start by looking at a given set of inputs and a set of outputs that correspond to those inputs. It identifies patterns between them and creates a set of complex rules that it can then apply to new inputs it hasn’t seen before and produce the desired output. For example, given the square footage, address and number of rooms (the input) we’re looking to predict a home’s sale price (the output). Let’s say we have data on the square footage, address and number of rooms of 10,000 houses, as well as their sales price. The machine will “train” itself on the data — i.e. identify patterns that determine how square footage, address and number of rooms impact a home’s price, so that if we give it those 3 inputs for a house it hasn’t seen before, it can predict that house’s price.
The data scientist’s role is to find the optimal machine to use given the inputs and the expected output. She has multiple templates — called algorithms — for machines. The machines she produces from those templates to solve a specific problem are called models. Templates have different options and settings that she can tweak to produce different models from the same template. She can use different templates and/or tweak the settings for the same template to generate many models that she can test to see which gives the best results.
Note that the model output is correct / useful for decision making at some degree of probability. Models are not 100% correct, but are rather “best guesses” given the amount of data the model has seen. The more data the model has seen, the more likely it is to give useful output.
The set of known inputs and outputs the data scientist uses to “train” the machine — i.e. let the model identify patterns in the data and create rules — is the “training set”. This data is used to with one or more “templates” to create one or more models that the data scientists thinks could work to solve the problem. Remember that even if she used only one “template” (algorithm), she can tweak some options to generate multiple models from the same template, with different settings, so she likely ends up with several models.
After she has a few of these “trained” models, she has to check how well they work and which one works best. She does that using a fresh set of data called the “validation set”. She runs the models on the validation set inputs to see which one gives results that are closest to the validation set outputs. In our example — which model will predict a home price that is closest to the actual price the home was sold for. She needs a fresh set of data at this stage because the models were created based on their performance with the training set, so they are biased to work well on that set and won’t give a true read.
Once she validated which model performs the best and picked the winner, our data scientist needs to determine the actual performance of that model, i.e. how good the best model she could produce really is in solving the problem. Again, she needs a fresh data set because the model clearly performs well on the training and validation sets — that’s how it was picked! The final data set is called the “test set”. In our example she’ll check how close the home prices predicted for the testing set inputs are to the testing set actual home prices. We will discuss measuring performance in more detail later.
Types of “Learning”
The type of algorithm you can apply to solve a machine learning problem very much depends on the data you have. A key classification of learning algorithms is based on the data required to build models that use them: Whether the data needs to include both inputs and outputs or just inputs, how many data points are required and when the data is collected. It includes 4 main categories: Supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning.
The case we discussed in detail in the previous section described what we call “supervised learning”. This is a type of learning where an algorithm needs to see a lot of labeled data examples — data that is comprised of both inputs and the corresponding output, in order to work. The “labeled” part refers to tagging the inputs with the outcome the model is trying to predict, in our example home prices.
Supervised learning algorithms see the labeled data (aka “ground truth” data), learn from it and make predictions based on those examples. They require a lot of labeled data upfront: While the number depend on the use case, hundreds of data points is the bare minimum to get to anything remotely useful.
Two classic problems solved through supervised learning are:
- Regression. Inferring the value of an unknown variable based on other pieces of data that it stands to reason would have an effect on that variable. Two common uses are point in time predictions — e.g. our previous example of predicting the value of a home based on variables such as location and square footage, and forecasting future values — e.g. forecasting home values a year from now based on historical and current home value data. Regression is a statistical method that determines the relationship between the independent variables (the data you already have) and the dependent variable whose value you’re looking to predict).
- Classification. Identifying which category an entity belongs to out of a given set of categories. This could be a binary classification — e.g. determining whether a post will go viral (yes / no), and multi-label categorization — e.g. labeling product photos with the appropriate category the product belongs to (out of possibly hundreds of categories).
In unsupervised learning the algorithm tries to identify patterns in the data without the need to tag the data set with the desired outcome. The data is “unlabeled” — it just “is”, without any meaningful label attached to it. A few classic problems solved through unsupervised learning methods are:
- Clustering. Given a certain similarity criteria, find which items are more similar to one another. One area where clustering is used is text — consider search results that return many documents that are very similar. Clustering can be used to group them together and make it easier for the user to identify the most distinct documents.
- Association. Categorize objects into buckets based on some relationship, so that the presence of one object in a bucket predicts the presence of another. For example, the “people who bought… also bought…” recommendation problem: If analysing a large number of shopping carts reveals that the presence of product X in a shopping cart is likely to indicate that product Y will also be in the shopping cart, you can immediately recommend product Y to anyone who put product X in their cart.
- Anomaly detection. Identifying unexpected patterns in data that need to be flagged and handled. Standard applications are fraud detection and health monitoring for complex systems. (Note: There are supervised anomaly detection techniques, but the use of unsupervised techniques is common since by definition it is quite difficult to obtain labeled data for anomalies, and that is a prerequisite for using supervised techniques.)
This is a hybrid between supervised and unsupervised learning, where the algorithm requires some training data, but a lot less than in the case of supervised learning (possibly an order of magnitude less). Algorithms could be extensions of methods used in either supervised and unsupervised learning — classification, regression, clustering, anomaly detection etc.
Here the algorithm starts with a limited set of data and learns as it gets more feedback about its predictions over time.
As you can see, in addition to the type of problem you’re trying to solve, the amount of data you have will impact the types of learning methods you can use. This also applies the other way — the learning method you need to use may require you to get more data than you have in order to effectively solve your problem. We’ll discuss that more later.
Other Common “Buzzwords” Worth Knowing
There are a few other terms you’ll often encounter as you work more in the space. It’s important to understand their relationship (or lack thereof) to the categories we discussed.
Deep learning is orthogonal to the above definitions. It is simply the application of a specific type of system to solve learning problems — the solution could be supervised, unsupervised etc.
An Artificial Neural Network (ANN) is a learning system which tries to simulate the way our brain works — through a network of “neurons” that are organized in layers. A neural network has at a minimum an input layer — the set of neurons through which data is ingested into the network, an output layer — the neurons through which results are communicated out, and one or more layers in between, called “hidden layers”, which are the layers that do the computational work. Deep learning is simply the use of neural networks with more than one hidden layer to accomplish a learning task. If you ever use such networks — congratulations, you can legitimately throw around the buzzword too!
Ensemble methods or ensemble learning is the use of multiple models to get a result that is better than what each model could achieve individually. The models could be based on different algorithms or on the same algorithm with different parameters. The idea is that instead of having one model that takes input and generates output — say a prediction of some kind, you have a set of models that each generate a prediction, and some process to weigh the different results and decide what the output of the combined group should be. Ensemble methods are frequently used in supervised learning (they’re very useful in prediction problems) but can also apply in unsupervised learning. Your data science team will likely test such methods and apply them when appropriate.
Natural language processing (NLP) is the field of computer science dealing with understanding language by machines. Not all types of NLP use machine learning. For example, if we generate a “tag cloud” — a visual representation of the number of times a word appears in a text — there is no learning involved. More sophisticated analysis and understanding of language and text often requires ML. Some examples:
- Keyword generation. Understanding the topic of a body of text and automatically creating keywords for it
- Language disambiguation. Determining the relevant meaning out of multiple possible interpretations of a word or a sentence (this is a great explanation with examples)
- Sentiment analysis. Understanding where on the scale of negative to positive the sentiment expressed in a text lies
- Named entity extraction. Identifying companies, people, places, brands etc. in a text; this is particularly difficult when the names are not distinctive (e.g. the company “Microsoft” is easier to identify than the company “Target”, which is also a word in the English language)
NLP is not only used for language-oriented applications of ML such as chatbots. It is also used extensively to prepare and pre-process data before it can be a useful input into many ML models. More on that later.
Please note: The definitions above are meant to convey the main ideas and be practical; for a detailed scientific definition please refer to other sources.
How the Problem Affects the Solution (And Some More Key ML Concepts)
The strategic goal you’re trying to achieve with ML will dictate many downstream decisions. It’s important to understand some basic ML concepts and their impact on your business goals in order to make sure your data science team can produce the right solution for your business.
A small change in the problem definition could mean a completely different algorithm is required to solve it, or at a minimum a different model will be built with different data inputs. A dating site looking to identify types of photos that work well for users may use unsupervised learning techniques like clustering to identify common themes that work, whereas if the problem is to recommend potential dates to a specific person the site may use supervised learning based on inputs specific to the individual user, such as photos they’ve already looked at.
ML models identify patterns in data. The data you feed into the models is organized into features (also called variables or attributes): These are relevant, largely independent pieces of data that describe some aspect of the phenomenon you’re trying to predict or identify.
Take the previous example of a company looking to prioritize outreach to loan applicants. If we define the problem as “prioritize customers based on their likelihood to convert”, we will include features such as the response rate of similar customers to the company’s various types of outreach. If we define the problem as “prioritize the customers most likely to repay their loans”, we may not include those features because they are irrelevant to evaluating the customer’s likelihood to pay.
Objective Function Selection
The objective function is the goal you’re optimizing for or the outcome the model is trying to predict. For example, if you’re trying to suggest products a user may be interested in, the output of a model could be the probability that a user will click on the product if they saw it. It may also be the probability that the user will buy the product. The choice of objective function depends primarily on your business goal — in this example, are you more interested in user engagement, in which case your objective function may be clicks or dwell time, or in direct revenue, in which case your objective function will be purchases? The other key consideration is data availability: For the algorithm to learn, you’ll have to feed it many data points that are “labeled” as positive (the products a user saw and clicked on) or negative (the products a user saw and didn’t click on). You’re likely to have an order of magnitude more data points of products that were clicked (or not clicked) on vs. products that were purchased.
Explainability and Interpretability
The output of ML models is often a number — a probability, a prediction of the likelihood something will happen or is true. In the product recommendations example, products on the site can be assigned a probability that an individual user will click on them, and the products with the highest probability will be shown to the user. But how do you know it works? In this case it’s relatively easy to verify that the algorithm works — you can probably run a short test and see. But what if the entities you’re ranking are potential employees and your model tests the likelihood of them to be good candidates for a company? Will a user (say, a hiring manager) just take your word for it, or will they have to understand why the algorithm ranked person A before person B?
In many cases you’ll have some explaining to do. However, many ML algorithms are a black box: You input many features, and get a model that is difficult-to-impossible to explain. The patterns the machine finds in the data are often so convoluted that a human will not be able to grasp them even if they were easy to put into words.
In subsequent sections we’ll see that the need for explainability — to what degree the end user needs to be able to understand how the result was achieved, and interpretability — to what degree the user needs to draw certain conclusions based on the results, is a critical consideration in your approach to modeling, selecting features and presenting results.
Modeling and Performance Measurement Pitfalls PMs Should Watch Out For
Your data scientists will deal with some common issues with data processing and modeling, but in order to have productive conversations with them it is useful for PMs to understand a few common pitfalls. It’s not an exhaustive list, but includes some of the more common issues that come up.
A model is said to be “overfitted” when it follows the data so closely that it ends up describing too much of the noise rather than the true underlying relationship within the data (see illustration). Broadly speaking, if the accuracy of the model on the data you train it with (the data the model “learns from”) is significantly better than its accuracy on the data with which you validate and test it, you may have a case of overfitting.
Precision, Recall and the Tradeoff Between Them
There are two terms that are very confusing the first time you hear them, but are important to fully understand since they have clear business implications.
The accuracy of classification (and other commonly used ML techniques such as document retrieval), is often measured by two key metrics: Precision and recall. Precision measures the share of true positive predictions out of all the positive predictions the algorithm generated, i.e. the % of positive predictions that are correct. If the precision is X%, X% of the algorithm’s positive predictions are true positives and (100-X)% are false positives. In other words, the higher the precision the less false positives you’ll have.
Recall is the share of positive predictions out of all the true positives in the data — i.e. what % of the true positives in the data your algorithm managed to identify as positives. If the recall is X%, X% of the true positives in the data were identified by the algorithm as positives, while (100-X)% were identified as (false) negatives. In other words, the higher the recall the less false negatives you’ll have.
There is always a tradeoff between precision and recall. If you don’t want any false positives — i.e. you need higher precision, the algorithm will have more false negatives, i.e. lower recall, because it would “prefer” to label something as a negative than to wrongly label it as a positive, and vice versa. This tradeoff is a business decision. Take the loan application example: Would you rather play it safe and only accept applicants you’re very sure deserve to be accepted, thereby increasing the chances of rejecting some good customers (higher precision, lower recall = less false positives, more false negatives), or accept more loan applicants that should be rejected but not risk missing out on good customers (higher recall but lower precision = less false negatives, more false positives)? While you can simplistically say this is an optimization problem, there are often factors to consider that are not easily quantifiable such as customer sentiment (e.g. unjustly rejected customers will be angry and vocal), brand risk (e.g. your reputation as an underwriter depends on a low loan default rate), legal obligations etc., making this very much a business, not a data science, decision.
The Often Misleading Model Accuracy Metric
Model accuracy alone is not a good measure for any model. Imagine a disease with an incidence rate of 0.1% in the population. A model that says no patient has the disease regardless of the input is 99.9% accurate, but completely useless. It’s important to always consider both precision and recall and balance them according to business needs. Accuracy is a good metric when the distribution of possible outcomes is quite uniform and the importance of false positives and false negatives is also about equal, which is rarely the case.
Averaging Metrics and Imbalanced Training Data
When you’re dealing with multiple segments a model has to address, you need to look at performance metrics for every segment (or at a minimum the important ones) separately. Take for example a classification model that classifies photos into one of a set of categories by the type of animal in the photo. The overall precision / recall numbers of the model may not reflect a situation where some categories have great precision, while others have very low precision. This usually happens when you have an imbalance in your training data — say you have 1,000 labeled photos of cats and dogs and only 10 photos of bears. Your overall precision may be very high since most of the cats and dogs photos will be classified correctly, while all the bears are misidentified because the model has little to no data that is associated with them. If those less frequent categories are important to your business, you may need a concerted effort to get training data for them to make your model work well across the board.
Ok — that was pretty long, but hopefully you now have a good understanding of all the technical basics. Next, we’ll go through the detailed, step-by-step process of developing a model from ideation to launch in production.
If you found this post interesting, would you please click on the green heart below to let me know, or share with someone else who may find it useful? That would totally make my day!