A Tour of Machine Learning
Some notes from Peter Flach’s book Machine Learning: The Art and Science of Algorithms that Make Sense of Data
Machine learning is an interesting area of study. The purpose of it is for a computer to “learn” from data. The machine generalizes from this data so that future actions can be optimized better based off previous data.
Features and tasks are a major component in machine learning. Its about gathering data, using the right features so that the models can achieve the right tasks. Peter Flach writes in his Machine Learning book that
In essence, features define a ‘language’ in which we describe the relevant objects in our domain, be they e-mails or complex organic molecules. We should not normally have to go back to the domain objects themselves once we have a suitable feature representation, which is why features play such an important role in machine learning.
A task is an abstract representation of a problem we want to solve regarding those domain objects: the most common form of these is classifying them into two or more classes, but we shall encounter other tasks throughout the book. Many of these tasks can be represented as a mapping from data points to outputs. This mapping or model is itself produced as the output of a machine learning algorithm applied to training data; there is a wide variety of models to choose from.
No matter what variety of machine learning models you may encounter, you will find that they are designed to solve one of only a small number of tasks and use only a few different types of features. One could say that the models lend the machine learning field diversity, but the tasks and features give it unity.
Spam email is a common task. It is a binary classification: spam or not-spam (called ‘ham’). In this, it has a decision boundary, which makes it clear, based on a probability, how to classify emails. But when dealing with multi-class classification, the notion of a decision boundary is less obvious and more difficult.
In some cases you abandon the notion classes altogether and instead want to predict a real number. Perhaps you want to determine the level of urgency for an incoming email. This task is called regression, and essentially “involves learning a real-valued function from training examples labelled with true function values. In a regression task, the notion of a decision boundary has no meaning, and so we have to find other ways to express a model’s confidence in its real-valued predictions.”
Both classification and regression assume there is training data. This brings us into supervised learning — using training data — and unsupervised learning — no training data. Clustering is heavily used in unsupervised learning. A typical clustering algorithm run on documents would put similar instances together and ‘dissimilar’ instances in a different cluster. (me)
There are four different machine learning settings:
Performance on a task
It is important to know that there is no ‘correct’ answer in machine learning. We have to think of machine learning problems in terms of accuracy of the classifier on a test set and real world data. When you’ve trained the model on training-data and then find the accuracy is a very low percentage on the test data, it would inform you that overfitting occured. Where the model was only good for training data and nothing else — a useless model.
Models: The Output of Machine Learning
Models are the central component of machine learning. It is what is being learned from data in order to solve a given task. There is a range of machine learning models to choose from, given the range of tasks that machine learning aims to solve: classification, regression, clustering, association discovery, etc.
Peter Flach groups machine learning models into three to make it easier to organize: geometric models, probabilistic models, and logical models.
The instance space is the set of all possible or describable instances, whether they are present in our data set or not. Usually this set has some geometric structure. For instance, if all features are numerical, then we can use each feature as a coordinate in a Cartesian coordinate system. A geometric model is constructed directly in instance space, using geometric concepts such as lines, planes, and distances. One main advantage of geometric classifiers is that they are easy to visualize, as long as we keep to two or three dimensions. It is important to keep in mind, though, that a Cartesian instance space has as many coordinates as there are features, which can be tens, hundreds, thousands, or even more. Such high-dimensional spaces are hard to imagine, but are nevertheless very common in machine learning. Geometric concepts that potentially apply to high-dimensional spaces are usually fixed with ‘hyper-’: for instance, a decision boundary in an unspecified number of dimensions is called a hyperplane.
Probabilistic models are relatively simple. Let X denote the variables we know about, like the instance’s feature values, and let Y denote the target variables. Given X, what is the probability that it is like Y. This applies to spam filtering. We’ve trained our model on email data that is already labeled spam and ham (not spam), and for new emails, what is the probability that the new email is ham? If it is higher than say 95%, and that is the decision boundary, we would classify that email as ham. Naive Bayes algorithms are used heavily in probabilistic models.
Logical models are, well, more logical. If Viagra == 1, then Class Y = Spam. Such rules can be organized into a feature tree.
The idea of such a tree is that features are sued to iteratively partition the instance space. The leaves of the tree therefore correspond to rectangular areas in the instance space which we call instance space segments, or segments for short. Feature trees whose leaves are labelled with classes are commonly called decision trees.
Tree-learning algorithms typically work in a top-down fashion. The first task is to find a good feature to split on at the top of the tree. The aim here is to find splits that result in improved purity of the nodes on the next level, where the purity of a node refers to the degree in which the training example belonging to that node are of the same class.
Features: The Workhorses of Machine Learning
A model is only as good as its features. Features can be thought of as a sort of measurement that can be preformed on any instance. In spam, a feature for measuring the probability of spam is an example. Feature construction is a critical component of machine learning.
Machine learning is about building the right models using the right features to achieve a given task. It can range from binary or multi-class classifications, regression, clustering, etc. Models can first be learned in a supervised fashion, ie with training data, or unsupervised, no training period. In unsupervised learning, “to evaluate the particular position of data into clusters, one can calculate the average distance from the cluster centre. Other forms of unsupervised learning include learning associations and identifying hidden variables such as film genres. Overfitting is also a concern in unsupervised learning.”
On the output side of the model, one can distinguish between predictive models and descriptive models. Predictive models have a target variable and descriptive models identify interesting structures in data. Predictive is used in supervised learning while descriptive is used in unsupervised learning.
Flach divides machine learning into three groups: geometric, probabilistic, and logical. Each is used differently to achieve a given task.