A quick tour of machine learning
This article dives into machine learning: how it works and what it can be used for.
Check out the author’s YouTube channel Serrano.Academy for lots of machine learning videos!
Machine learning is common sense for a computer. Machine learning roughly mimics the process by which humans make decisions based on experience, by making decisions based on previous data. Naturally, programming computers to mimic the human thinking process is challenging, because computers are engineered to store and process numbers, not make decisions. This is the task that machine learning aims to tackle. Machine learning is divided into several branches, depending on the type of decision to be made. In this article, we overview some of the most important among these branches.
Machine learning has applications in many fields, such as the following:
· Predicting house prices based on the house’s size, number of rooms, and location
· Predicting today’s stock market prices based on yesterday’s prices and other factors of the market
· Detecting spam and non-spam emails based on the words in the e-mail and the sender
· Recognizing images as faces or animals, based on the pixels in the image
· Processing long text documents and outputting a summary
· Recommending videos or movies to a user (e.g., on YouTube or Netflix)
· Building chatbots that interact with humans and answer questions
· Training self-driving cars to navigate a city by themselves
· Diagnosing patients as sick or healthy
· Segmenting the market into similar groups based on location, acquisitive power, and interests
· Playing games like chess or Go
Try to imagine how we could use machine learning in each of these fields. Notice that some of these applications are different but can be solved in a similar way. For example, predicting housing prices and predicting stock prices can be done using similar techniques. Likewise, predicting whether an email is spam and predicting whether a credit card transaction is legitimate or fraudulent can also be done using similar techniques. What about grouping users of an app based on their similarity? That sounds different from predicting housing prices, but it could be done similarly to grouping newspaper articles by topic. And what about playing chess? That sounds different from all the other previous applications, but it could be like playing Go.
Machine learning models are grouped into different types, according to the way they operate. The main three families of machine learning models are
· supervised learning,
· unsupervised learning, and
· reinforcement learning.
In this article, we focus only supervised learning, because it is the most natural one to start learning and arguably the most used. I encourage you to look up the other types in the literature and learn about them, too, because they are all interesting and useful! In the resources, you can find some interesting links, including several videos of mine. In particular, this video has an overview of machine learning that you may find very useful!
What is the difference between labeled and unlabeled data?
What is data?
Before we go any further, let’s establish a clear definition of what we mean by data in this article. Data is simply information. Any time we have a table with information, we have data. Normally, each row in our table is a data point. Say, for example, that we have a dataset of pets. In this case, each row represents a different pet. Each pet in the table is described by certain features of that pet.
And what are features?
If our data is in a table, the features are the columns of the table. In our pet example, the features may be size, name, type, or weight. Features could even be the colors of the pixels in an image of the pet. This is what describes our data. Some features are special, though, and we call them labels.
This one is a bit less straightforward, because it depends on the context of the problem we are trying to solve. Normally, if we are trying to predict a particular feature based on the other ones, that feature is the label. If we are trying to predict the type of pet (e.g., cat or dog) based on information on that pet, then the label is the type of pet (cat/dog). If we are trying to predict if the pet is sick or healthy based on symptoms and other information, then the label is the state of the pet (sick/healthy). If we are trying to predict the age of the pet, then the label is the age (a number).
We have been using the concept of making predictions freely, but let’s now pin it down. The goal of a predictive machine learning model is to guess the labels in the data. The guess that the model makes is called a prediction.
Now that we know what labels are, we can understand there are two main types of data: labeled and unlabeled data.
Labeled and unlabeled data
Labeled data is data that comes with labels. Unlabeled data is data that comes with no labels. An example of labeled data is a dataset of emails that comes with a column that records whether the emails are spam or ham, or a column that records whether the email is work related. An example of unlabeled data is a dataset of emails that has no column we are interested in predicting.
In figure 1, we see three datasets containing images of pets. The first dataset has a column recording the type of pet, and the second dataset has a column specifying the weight of the pet. These two are examples of labeled data. The third dataset consists only of images, with no label, making it unlabeled data.
Unlabeled data is data that comes with no tag. The dataset on the left is labeled, and the label is the type of pet (dog/cat). The dataset in the middle is also labeled, and the label is the weight of the pet (in pounds). The dataset on the right is unlabeled.
Of course, this definition contains some ambiguity, because depending on the problem, we decide whether a particular feature qualifies as a label. Thus, determining if data is labeled or unlabeled many times depends on the problem we are trying to solve.
Labeled and unlabeled data yield two different branches of machine learning called supervised and unsupervised learning. Next I define supervised learning.
Supervised learning: The branch of machine learning that works with labeled data
We can find supervised in some of the most common applications nowadays, including image recognition, various forms of text processing, and recommendation systems. Supervised learning is a type of machine learning that uses labeled data. In short, the goal of a supervised learning model is to predict (guess) the labels.
In the example in figure 1, the dataset on the left contains images of dogs and cats, and the labels are “dog” and “cat.” For this dataset, the machine learning model would use previous data to predict the label of new data points. This means, if we bring in a new image without a label, the model will guess whether the image is of a dog or a cat, thus predicting the label of the data point.
One framework for making a decision is remember-formulate-predict. This is precisely how supervised learning works. The model first remembers the dataset of dogs and cats. Then it formulates a model, or a rule, for what it believes constitutes a dog and a cat. Finally, when a new image comes in, the model makes a prediction about what it thinks the label of the image is, namely, a dog or a cat.
Now, notice that in figure 1, we have two types of labeled datasets. In the dataset in the middle, each data point is labeled with the weight of the animal. In this dataset, the labels are numbers. In the dataset on the left, each data point is labeled with the type of animal (dog or cat). In this dataset, the labels are states. Numbers and states are the two types of data that we’ll encounter in supervised learning models. We call the first type numerical data and the second type categorical data.
- Numerical data is any type of data that uses numbers such as 4, 2.35, or –199. Examples of numerical data are prices, sizes, or weights.
- Categorical data is any type of data that uses categories, or states, such as male/female or cat/dog/bird. For this type of data, we have a finite set of categories to associate to each of the data points.
This gives rise to the following two types of supervised learning models:
- Regression models are the types of models that predict numerical data. The output of a regression model is a number, such as the weight of the animal.
- Classification models are the types of models that predict categorical data. The output of a classification model is a category, or a state, such as the type of animal (cat or dog).
Let’s look at two examples of supervised learning models, one regression and one classification.
Model 1: Housing prices model (regression). In this model, each data point is a house. The label of each house is its price. Our goal is that when a new house (data point) comes on the market, we would like to predict its label, namely, its price.
Model 2: Email spam–detection model (classification). In this model, each data point is an email. The label of each email is either spam or ham. Our goal is that when a new email (data point) comes into our inbox, we would like to predict its label, namely, whether it is spam or ham.
Notice the difference between models 1 and 2.
· The housing prices model is a model that can return a number from many possibilities, such as $100, $250,000, or $3,125,672.33. Thus, it is a regression model.
· The spam detection model, on the other hand, can return only two things: spam or ham. Thus, it is a classification model.
Next, I elaborate some more on regression and classification.
Regression models predict numbers
As I mentioned previously, regression models are those in which the label we want to predict is a number. This number is predicted based on the features. In the housing example, the features can be anything that describes a house, such as the size, the number of rooms, the distance to the closest school, or the crime rate in the neighborhood.
Other places where one can use regression models follow:
· Stock market: predicting the price of a certain stock based on other stock prices and other market signals
· Medicine: predicting the expected life span of a patient or the expected recovery time, based on symptoms and the medical history of the patient
· Sales: predicting the expected amount of money a customer will spend, based on the client’s demographics and past purchase behavior
· Video recommendations: predicting the expected amount of time a user will watch a video, based on the user’s demographics and other videos they have watched
The most common method used for regression is linear regression, which using linear functions (lines or similar objects) to make our predictions based on the features. Other popular methods used for regression are decision tree regression, and several ensemble methods such as random forests, AdaBoost, gradient boosted trees, and XGBoost.
Classification models predict a state
Classification models are those in which the label we want to predict is a state belonging to a finite set of states. The most common classification models predict a “yes” or a “no,” but many other models use a larger set of states. The example we saw in figure 3 is an example of classification, because it predicts the type of the pet, namely, “cat” or “dog.”
In the email spam recognition example, the model predicts the state of the email (namely, spam or ham) from the features of the email. In this case, the features of the email can be the words on it, the number of spelling mistakes, the sender, or anything else that describes the email.
Another common application of classification is image recognition. The most popular image recognition models take as input the pixels in the image, and they output a prediction of what the image depicts. Two of the most famous datasets for image recognition are MNIST and CIFAR-10. MNIST contains approximately 60,000 28-by-28-pixel black-and-white images of handwritten digits which are labelled 0–9. These images come from a combination of sources, including the American Census Bureau and a repository of handwritten digits written by American high school students. The MNIST dataset can be found in the following link: http://yann.lecun.com/exdb/mnist/. The CIFAR-10 dataset contains 60,000 32-by-32-pixel colored images of different things. These images are labeled with 10 different objects (thus the 10 in its name), namely airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. This database is maintained by the Canadian Institute for Advanced Research (CIFAR), and it can be found in the following link: https://www.cs.toronto.edu/~kriz/cifar.html.
Some additional powerful applications of classification models follow:
· Sentiment analysis: predicting whether a movie review is positive or negative, based on the words in the review
· Website traffic: predicting whether a user will click a link or not, based on the user’s demographics and past interaction with the site
· Social media: predicting whether a user will befriend or interact with another user, based on their demographics, history, and friends in common
· Video recommendations: predicting whether a user will watch a video, based on the user’s demographics and other videos they have watched
That’s all for now. If you want to learn more, check out the book on Manning’s liveBook platform here.