An Honest Guide to Machine Learning : Part Two
The Core of Machine Learning
The Honest Guide to Machine Learning provides a deep dive into machine learning technology — no math necessary.
In Part 2 of our guide, we expand on the lessons from Part 1 so that you have everything you need to know about Machine Learning.
In Part 3, we’ll talk about Natural Language Processing (NLP).
Let’s start with the basic organizational structure of machine learning (ML). ML is organized into buckets called tasks. Tasks are divided into three parts: an input, a model, and a desired output. That task then has two phases: training and using (decoding). Remember, it’s called machine learning for a reason — before a task can be used to actually solve a problem, it has to be trained. That’s referred to as training the model. Once the model is implemented, it’s ready to use for decoding.
It All Starts With Input
When you begin to train a model, you start with input (a set of data points). Your input is related to the problem you want to solve; for every input data point, you have a list of features. Features are informative measures about each data point inside the input. For example, the problem you want to solve might be identifying whether an animal is a cat or a dog. In that case, pictures of cats and dogs would be your input. For cat, you would program a list of features that would include whiskers, triangle ears, a small pink nose. There are two kinds of features: binary features (Is the animal grey, yes or no) and categorical features (What is the colour of the animal, grey, tan, brown, or black?). Sometimes features are too complicated to fit either kind: for instance, features that are values, such as the weight of the animal. Values can be broken into groups, which then become categorical (What is the weight of the animal, 0–2 pounds, 2–4 pounds, 4–6 pounds); and categorical features can be further broken down into binary features (Is the weight more than 4 pounds, yes or no?). The more features you have, the harder it is to train the model, and the more data you need.
From the Beginning to the End
Machine learning tasks are often classified based on output type. We’ve broken it down into six of the most common output types, plus a final bucket to capture all of the myriad smaller potential outputs. One of the interesting challenges in machine learning is that there is no “accepted terminology.” That can make it very difficult to, say, find a paper where someone is working on the same avenue as you are, because other projects will be using different terms.
1. Classification. This is the most common type of machine learning output. In this instance, your output is a set of predefined labels. The previous example of cat vs dog identification is a perfect example: if you have access to a large number of labelled photos of each, you would probably aim for classification output. Two classes (cat vs dog) is binary classification, and after that goes up based on the number of classes: 3 class, 4 class, etc.. Example Project: Gmail uses this system to detect whether or not an email is spam. The input is a list of keywords or phrases plus some info from the header of your email, and the output is spam or not (ham).
2. Clustering. Let’s say you have a bunch of photos of animals, but they aren’t labelled — you don’t know what kinds of animals they are. You want to categorize them, but you can’t use classification because you don’t have the right input labels. In this case, you would use clustering to find the similarities and group those images. Sometimes, you might not even know how many possible outputs there might be — perhaps you have photos from a trip to the zoo, but you don’t know how many different animals the zoo keeps. That’s called hierarchical clustering. Example Project: Medical machine learning uses clustering to identify mutations. Their input is gene sequences, and their output is types of mutations.
3. Regression. This works similarly to classification, but instead of categories (labels), you have numbers. Based on input, you have to predict one number at the end. A famous example of regression is trying to predict how much a house will sell for. The input is features for previously sold houses, such as number of rooms and square footage; your output is a single number, the price the house will sell for. Example Project: This is used today to predict stock prices. The input is recent news and tweets about a company, and the output is a prediction of what the stock will be worth.
4. Dimensional reduction. Having enough data is always a tricky part of machine learning. If you don’t have enough data, you have to reduce your features somehow. Dimensional reduction lets you map the features you have into smaller groups, or select only the features there is enough data for. Sometimes the model will try to find a better combination of previous features; for example, it’ll combine three features into one category. The famous technique for this is called PCA — principle component analysis. Example Project: You have thousands of sensors to detect minerals in the soil; it’s too large a sample size, so you narrow it down to only a few sensors.
5. Anomaly detection. If you have the input, and you want to understand which part of the input is not in harmony with the others, you use anomaly detection. The output in this case is always an isolated part of the input. Example Project: Credit card companies use this form of machine learning to detect strange purchases. For instance, if a client makes six purchases in a row from the same gas station, machine learning flags that as an anomaly in normal purchase patterns, where generally only one purchase is made per store.
6. Association Rule Learning/Detecting. For this output type, you have a list of features and you want to find out how they are connected to each other. Your output is what associations each of those items have. Project Example: Walmart and other grocery chains use this to analyze shopping carts. Thanks to machine learning they can tell that most people who buy tortilla chips also buy dip, so they know to put those items close together on the shelves.
7. There are an almost unlimited number of potential outputs. While those six are the ones most commonly used, we didn’t want to imply that there aren’t others out there that get the job done too! For instance, collaborating filtering and density estimation are both fairly popular. You could fall into a Wikipedia hole trying to learn about them all.
The Meat of the Matter — the Model
In contrast to the epic number of potential outputs, there are only two large categories of model: generative and discriminative.
Generative models, as per their name, generate output. Their goal is to model the world. For example, if you feed it images of cats and dogs and teach it distinguish between them, it will eventually learn to generate its own picture of a cat or a dog.
Discriminative models don’t try to generate, but rather to discriminate between two (or multiple things. They aren’t as strong, but they’re much easier to train. A discriminative model wouldn’t be able to create it’s own dog, but it can learn to tell the difference between dogs and cats. If you don’t have enough data, discriminative models are the ones to choose.
Based on the input and output, you should have a mathematical formula you want to optimize. The model does that by moving from input to output. Say your input is pixels: you want to change that input to an output which is either cat or dog. This is where the math comes in, but don’t worry — the model handles the math for you. Each model has a formulation, which you can imagine as having knobs and control factors. Training a model is just setting and tuning those factors. Imagine you have a microphone (input) and a speaker (output). Your model is the amp: you know it will increase your voice, but you have to tune the bass and treble to get the sound you want. The “bass” and “treble” are called parameters. When you train a model you optimize those parameters — which is what we’re talking about when we say machine learning is looking for the optimum. And just like you don’t need to know how an amp works to use it, you often don’t need to know exactly how the formulation works — you just need to know how to tune those knobs and controls.
So, what should this formulation be? You only have two model types, but there are many formulations you can use within that binary. Some popular formulations include Support Vector Machines (SVM), Decision Trees, Conditional Random Fields, Neural Networks (this is where we get deep learning, which you’ve no doubt heard of. DL is neural networks rebranded), and Log Linear Models, but these are only some of many. A machine learning engineer will take a problem, categorize it based on its features, and select the right model to optimize the parameters. They will then define the a success metric. That ensures the model isn’t simply memorizing the data, and that it’s generalized enough to predict a future outcome without being surprised.
But It All Starts Somewhere
Now that you understand the different kinds of output, and you understand the models through which those outputs are generated, there’s one last thing to go over: input. There are several different kinds of input, which can affect both your model and your output.
If you have input data, and you have labels for that data, the process by which you train the algorithm is called supervised learning. Humans had to put the labels on that data, which is a time-consuming process, and then they’ve fed it to the training set. Classification is a good example of an output which can be supervised, though it can also be semi-supervised (see below).
If you don’t have any labels for your data, you can’t necessarily feed the algorithm in the same way. This is called unsupervised learning, because the machine will have to generate its own labels. Clustering, which we described in detail above, is always unsupervised.
There are times when you will have some labelled data, but a limited amount, meaning the training set will begin to train by example, but continue from there unsupervised. This can happen with classification, and is called semi-supervised learning.
Finally, there’s reinforcement learning. In this instance you don’t have labels, but you do have a constant influx of new information — new labels. Based on predictions, the environment creates a reward system that encourages a certain kind of behaviour, such as getting the labels right.
It used to be that machine learning could handle only one task at once, but computational power is improving to the point where people are now moving from learning only one task to learning multiple ones, and to more complicated models. This can help arrest generalization, and is certainly the direction of the future.
Welcome to Machine Learning
If you hire a machine learning engineer, their job is to break your problem into one or more of these tasks and then select the right models. They have to find the right training data, define the evaluation metric, and define the test set. Then for each task, based on the type of input and output you have, they select a model, and then train it to get the best evaluation. After the training is done. it will put it into the decoding/in action stage, where it will continue as long as you keep feeding it input.
Congratulations! If you made it through, you officially understand (or at least have the tools to understand!) how machine learning works.