Given the attention machine learning has received and the sophisticated problems it is solving, it may seem like magic. But it is not. It’s built on a foundation of mathematics and statistics, developed over many decades.
One very practical way to think of machine learning is as a unique way of programming computers. Most programming that is not machine learning (and the practical programs that humans have dealt with the most over the last 50 years) is procedural — it’s essentially a set of rules defined by a human. This ruleset is called an algorithm.
In machine learning, the underlying algorithm is selected or designed by a human. However, the algorithms learn from data, rather than direct human intervention, about the parameters that will shape a mathematical model for making predictions. Humans don’t know or set those parameters — the machine does. Put another way, a data set is used to train a mathematical model so that when it sees similar data in the future, it knows what to do with it. Models typically take data as an input and then output a prediction of something of interest.
Executives don’t need to be machine learning experts, but even a little knowledge can go a long way. If you can understand the basic kinds of things you can do with ML, you’ll have an idea of where to start and know what you should be digging into. And you won’t have to blindly tell your technical team to “go do some magic” and then hope they succeed. In this post, we’ll give you just enough knowledge to be dangerous. We’ll start with machine learning techniques you may have heard about, address a fundamental ML challenge, dive into deep learning, and discuss the physical, computational realities that make it all possible. All in all, we hope your conversations with data scientists and engineers are a bit more productive.
Machine Learning Techniques
Machines learn in different ways with varying amounts of “supervision”: supervised, unsupervised, and semi-supervised. Supervised learning is the most deployed form of ML and also the easiest. However, unsupervised learning doesn’t require as much data and has the most practical use cases.
Machines often learn from sample data that has both an example input and an example output. For example, one data-sample pair may be input data about an individual’s credit history, and the associated output is the corresponding credit risk (either specified by a human or based on historical outcomes). Given enough of these input-output samples, the machine learns how to construct a model that is consistent with the samples it trained on.
From there, the model can be applied to new data that it has never seen before — in this case, the credit histories of new individuals. After learning from sample data, the model applies what it has learned to the real world.
This class of machine learning is called “supervised learning,” since the desired predicted outcome is given, and the model is “supervised” to learn the associated model parameters. Humans know the right answer, and they supervise the model as it learns how to find it. Since humans must label all of the data, supervised learning is a time-intensive process.
Supervised learning problems include:
The goal of a classification problem is to determine what group a given input belongs to. For instance, a medical case where the possibilities are disease present / disease not present. Another classic example is categorizing animal pictures into a cat group and a dog group.
The machine is trained on data with many examples of inputs (like an image of an animal) along with corresponding outputs, often called labels (like “cat” or “dog”). Train the model with a million pictures of dogs and cats, and it should be able to classify a picture of a new dog that wasn’t in the training data.
Like classification, regression is also about inputs and corresponding outputs. But outputs for classification are typically discrete types (cat, dog), outputs for regression are a general number. In other words, it’s not a 0 or 1, but a sliding scale of possibility. For example, given a radiological image, a model could predict how many more years the associated individual will be sick or healthy.
In unsupervised learning, the machine learns from data for which the outcomes are not known. It’s given input samples, but no output samples.
For instance, imagine you have a set of documents that you would like to organize. For example, some documents may be about sports, others about history, and still others about the arts. Given only the set of documents, the objective is to automatically learn how to cluster them into types.
For clustering, only the input (the data one seeks to organize) is provided in the sample data. No explicit output is provided. The model may cluster the sports documents in one group and the history documents in another, but it was never told explicitly what a sports or history document looked like, as it was never shown sample output data. In fact, once clustering is complete, the model still won’t know what a sports or history document is. All the model “knows” is that inputs in Group A are similar to each other, as are the inputs in Group B. It’s for humans to look at the clusters and decide whether and how they make sense.
Unsupervised learning is less common in practical business settings, but it is attractive: you don’t need labeled data and can avoid the human effort and cost of doing so. Unsupervised learning is potentially applicable in many more areas, since it’s not narrowly restricted to applications with labeled data.
As we’ve seen, supervised and unsupervised tasks have different data requirements: supervised learning is typically demanding from the standpoint of acquiring data for learning, while unsupervised learning is relatively simple. In semi-supervised learning, data scientists combine the two. A model uses unlabeled data to gain a general sense of the data’s structure, then uses a small amount of labeled data to learn how to group and organize the data as a whole. Sometimes referred to as “weak learning.”
The advantage of this approach is that often the quantity of labeled data needed for learning a good model may be reduced, as there is an opportunity to learn from the contextual information provided by unlabeled data.
Transfer learning is essentially transferring knowledge from one task to another. Humans are very good at this — if a child learns how to play, baseball, aspects of what they learn will also help them play kickball.
An ML example is training a model to classify images of cats, dogs, tables, etc., perhaps using the famous ImageNet data set of millions of images. Once a model is trained to do that, a significant portion of the model could then be used for a completely different task like identifying tumors in an X-ray image.
In this case one may reuse a substantial portion of the prior model and then specialize the remaining portion of the model to the specific new task of interest. It turns out that features learned for one task, like classifying cats and dogs, could also useful for, say, finding tumors. Fine-tuning is required, of course, adjusting the weights so that they’re now more appropriate for the new task. But in transfer learning, starting from the solution to the first task will get a better, faster solution than starting from scratch.
Transfer learning can substantially reduce the amount of data needed for a new task, which is a potential business benefit. Tell an executive that you have an amazing image detector that requires training on a million images, and the executive may despair that they don’t have a million images, let alone the ability to label them all. The nice thing about transfer learning is that you don’t need a million images. You might only be in thousands or tens of thousands instead. Using a pre-trained network, we can still get a good solution faster and better than if we were just trying to train it on just your data.
Another example of transfer learning happens with natural language processing, which happens when machines process text. After an initial model learns language, grammar, spelling, etc., that learning can be transferred to tasks like sentiment or classification.
However, a word of caution: don’t assume that tasks for which humans could easily transfer their learning are necessarily ripe for machine transfer learning. It’s not always obvious. While it’s very easy as a human to anthropomorphize machines, machines are not people, and they learn in a very different way.
Recall supervised learning requires input and output examples. Reinforcement learning is like unsupervised learning in the sense that outputs are usually not given.
The central concept of reinforcement learning is based around an “agent” (a computer or robot) that is interacting with an “environment” (here defined as everything that is not the agent).
The agent performs actions on the environment (for instance, a robot takes a step forward). Then, the environments will then provide some sort of feedback to the agent, usually in a form called the “reward.”
By “reward”, we don’t mean we give the machine a jolt of electrons. We literally just add to the program’s reward counter. The agent’s goal is to maximize the number in that counter. Critically, however, no one is telling the agent how to maximize the reward or explaining why it gets a reward. That’s what the agent figures out for itself by taking actions and observing its environment.
In many forms of reinforcement learning, the agent does not know what the objective is because it does not have examples of success. All it knows is whether it receives the reward or not.
Reinforcement learning has echoes of human psychology — the brain experiences something good and a dopamine rush makes a person want more of it. A bad experience, like touching a hot stove, causes pain that discourages the person from repeating the behavior. However, despite the parallels to human psychology, humanizing it too much is a mistake.
Overfitting: The Classic ML Error
Data scientists try hard to make training data as representative of the real world as possible. Otherwise, it’s possible to build a model that maximized performance for the training data but is unsuitable for anything else. This is called overfitting.
To explain, let’s take an extreme example. A lazy data scientist could build a dog/cat image detector with a simple lookup table: given this specific input, product this specific output. If the data scientist had a thousand labeled dog/cat images and put each one into the lookup table, they could create a “perfect” classification system with no machine learning required. When the system saw a precise combination of pixels, it would know what to label it.
Needless to say, this would be a terrible system — it would only work for the image data it already knows. Change just one pixel, and the system wouldn’t recognize a dog anymore. This is overfitting: the model perfectly fits the training data (the 1,000 labeled images) but not the real world (any other images of a dog).
Fortunately, we don’t know any data scientists like this. Overfitting in practice isn’t quite so obvious, but it’s still a major concern.
Consider a graph with a house’s square footage on the X-axis and its price on the Y-axis. Imagine a few data points plotted on the graph. The relationship between square footage and price could be represented most simply as a straight line.
But that line wouldn’t be perfect — in fact, it might miss all of the actual data points. Think back to algebra class, you’ll remember that this linear function can be represented as y = mx + b. m and b are called the parameters — change them, and you change the nature of the line.
You could add more parameters to create a more complicated function to better fit the data. For example, here’s a function where y = ax2 + bx + c. You can see that with three parameters, the curve more tightly matches to your data.
Given this trend, you could naturally keep on increasing the number of parameters until your line fits precisely to every point of your training data.
But common sense tells you that this line isn’t a true representation of housing prices in the real world. We’ve built a model that only works for just our training data but that doesn’t generalize. Real-world data isn’t going to fall neatly on a smoothed-out curve. It may look like a polished solution, but it’s little better than the dog/cat lookup table.
Training, Validation, and Testing
To guard against overfitting, data scientists typically split data into three sets: training, validation, and test. They scramble data across these sets randomly, because they want them to have the same distribution.
The training set may have 70% of the data. This is the only data used as feedback for the model to learn from. Data scientists want to optimize the model by decreasing loss — that is, decreasing how poorly the model does at describing the training data. As optimization continues with training data, loss should decrease.
Validation (about 10% of the data) is data that the model isn’t trained on. Rather, it’s used to help find the right model, fine-tune parameters, and prevent overfitting. As our model progresses, data scientists should see loss reductions for the validation set as well. But at some point, the model will be optimizing to the training data but not to the validation data. Loss for training will continue to decrease, but loss for validation will start to creep back up. That’s how data scientists can detect overfitting, or that they’ve picked the wrong model. If the model only works on the training data but not the validation data, it’s not properly generalizing.
But there’s still a problem. If a model fits the training data set and the validation set, there’s still a chance it just got “lucky” that the validation set happened to be similar to the training set. Even though the data sets are random, there’s still a chance the model just happens to do well for your training and validation data, but still wouldn’t perform well in the real world.
So how do you determine how good a model actually is? Enter the test data set — the remaining 20% of the data. In theory, testing should only be done this once, although machine learning in practice may bend this rule. But the reason to do this just once is that ideally, once a model is finished, this test set gives a good estimate of how the model will do in the real world.
In practice, of course, real-world data may not match training, testing, or validation data. To the extent that that’s true (either because the initial data was bad or the real world is changing), the model may not work as well once it’s deployed. It’s possible for the data a model encounters in deployment to feed back into the model and improve it, but this can require human oversight and retraining. Left to its own devices, a model may not perform well with real-world data and drift away from its ideal performance.
The Physical Realities of Machine Learning
Machine learning doesn’t just happen in the ether. All that computation has to take place somewhere. Whether you do your calculations on-site or in the cloud, machine learning is a physical reality as much as a mathematical one. Here are some key concepts to know as you talk to data scientists and engineers about your ML capabilities. As you might guess, your machine learning needs will vary based on factors like how quickly you need to respond, how many predictions you’ll be doing, and what those requests look like.
Compute / Processing
These are the processors that train and serve your machine learning model. A popular option here are Graphical Processing Units (GPUs), which were originally built for video gaming. Experts then discovered that GPUs could train deep learning networks much faster than central processing units (CPUs). While a technologist might object to this characterization, you can think of a GPU as a really fast CPU.
So if GPUs are faster, they must always be better, right?
Not in every case. For one thing, GPUs are more expensive than CPUs. For another, you need specialized programs that can actually use them. GPUs are good for simple tasks, but CPUs are better for complex ones. And GPUs don’t always make a big difference. If you’re training an algorithm on 10 million records, being able to move 100 times faster could be important. But if, after training has occurred, you’re only making one prediction at a time, then going from 100 milliseconds to 1 millisecond will be an imperceptible difference to the end-user — and thus, a waste of money. Before you make a knee-jerk decision to buy or rent GPUs, make sure you have a lot of data and a clear need.
GPU machines are often more useful during the training portion of the ML process, when they are needed to process large amounts of data and set the weights and parameters of a given machine learning model. The compute resources required to run a model in production that’s already been trained are typically less powerful.
Hypothetical Requirements: Training vs. Production
- 64 GB RAM
- 500 GB Disk
- 2x GPU
- 2 GB RAM
- 500 GB Disk
- 2x CPU
This is computer memory. It’s generally able to hold less than the long-term storage hard drive, but it’s faster. RAM constrains the number of things that can be processed at the same time. It’s also referred to as volatile, because once the computer reboots it goes away.
A good analogy is an airline tray. If you put your laptop on it but are served a drink, you’ll need to close your laptop and move it off if you want space for your beverage. When you turn the computer off, RAM gets cleared out — just like your tray at the end of your flight.
More RAM is useful in cases where you could potentially field large volumes of simultaneous requests. If you don’t need to do that, you won’t need as much RAM.
Long-Term Storage/Disk Space
This is the hard drive where the data permanently resides. Disk space is where you store stuff forever, like a file cabinet. You load things out of this and onto your ‘airline tray” when your ready to take a look at them. In the machine learning world, storage typically contains your machine learning model and your data set.
If all you need is something that works off the shelf, you don’t necessarily need to worry about the underlying infrastructure for your ML project at all. Like video on demand, machine learning can be delivered to you as a service. Just as you don’t need to know what kind of processors are underlying your latest streaming binge-watch, you don’t need to worry about whether GPUs or CPUs are behind your ML tool. Just upload data, and select some parameters, and let it run.
So how do you know whether you need custom development or can rely on an existing framework? It depends on the difficulty of the problem you are trying to solve, the knowledge that you have on how to solve it. Off the shelf frameworks are simple to use, but also best suited for relatively simple problems.
But much machine learning, especially deep learning, is not plug-and-play. Just tuning a learning rate one way or another almost in a way that isn’t necessarily exactly using your intuition can make a big difference. Of course, these frameworks will improve over time and become increasingly useful for complex problems. But for now, off the shelf frameworks are still for simplest, most well-defined, most easily-structured problems. These are by definition problems someone else has already solved. The more your specific challenge requires unique customization, the less you’ll be able to rely on something from the shelf.
Again, the key phrase here is “for now.” These frameworks continue to improve, so even if you rightfully dismiss them now, you might want to revisit their capabilities periodically to see if their advancements can address your needs.
Are You Optimizing the Right Things?
You don’t necessarily need the world’s most limit-pushing technology. For instance, it likely never matters to you that your car doesn’t go above 200 miles per hour. Other factors, like safety or the number of seats, may be more of a daily concern.
There frequently is a benefit having a faster GPU or having, a faster memory, faster storage or a faster network. But that doesn’t mean it will be worth it in the context of your underlying business problem.
You can waste time and money optimizing the wrong thing. For instance, optimizing how fast a timesheet adds up the number of hours is not going to fix the fact it takes a person 20 minutes to fill it out. It’s possible to make a process much faster while the overall result doesn’t change.
On-Prem vs. the Cloud
Eventually, you’ll have to decide whether to house your infrastructure yourself (having it “on-premises” or “on-prem”) to or look to existing options in the cloud (a service like AWS or Azure).
With traditional software development, the cloud is typically a reasonable choice. It doesn’t make sense for most companies to host their own data centers or to buy servers to keep at their office. AWS claims clients from NASA to Netflix.
Machine learning, however, has changed this calculation. While the aforementioned off the shelf options will likely be delivered via the cloud, other types of machine learning might actually make sense on-prem. You’ll want to look at the total cost of ownership, comparing the monthly cost of the cloud provider to your internal IT cost, power, space, etc. While that framework may eventually indicate that everyone is better off the cloud, that’s not the case today. In the fall of 2018, AI entrepreneur Jeff Chen wrote that “building your own Deep Learning Computer is 10x cheaper than AWS.”
Of course, this is all about deciding what is right for your business, in collaboration with your technical team. If you’re only training once a year, the cloud may be a better option than maintaining the hardware yourself. You get to take advantage of the decrease in prices as they occur. If Amazon drops their prices and you’re already on Amazon, then you get lower prices.
With the cloud, there are no maintenance or security costs and no hardware refreshes — you automatically take advantage of the latest machines, and you don’t have to sell GPUs if the project ends.
Plus, entities like AWS and Azure have built a reputation as “sensible” options that may be easier for management to approve. Plus, it might be nice to have an outside entity to point to if things don’t work. In fact, when Chen’s aforementioned article on AWS costs was posted to the news aggregation site Y Combinator’s Hacker News website, the top comment was telling: “You’re forgetting the cost of fighting IT in a bureaucratic corporation to get them to let you buy/run non-standard hardware.” Then again, in some cases, data sensitivity may steer projects toward being on-prem, even with cloud vendors making security assurances.
We’ve covered a lot of ground in this post: machine learning techniques, overfitting, and, infrastructure issues. You may have also begun to understand some of the complexity that data scientists and engineers are dealing with as they build systems that use machine learning, especially in organizations that have never done it before. Data science isn’t magic — and now that you’ve seen behind the curtain, you’re better prepared to help put it into practice.
Robbie Allen is a Senior Advisor to Infinia ML, a team of data scientists, engineers, and business experts putting machine learning to work.