How Classification Works

Colin Jemmott
Seismic Innovation Labs
16 min readJul 11, 2018

This is a write up of a talk I gave in February 2008, trying to convince a group of physicists that they might want to check out this machine learning stuff. I ran across it recently, and while there are a ton of really good introductions to machine learning out there now, this one spent more time on motivating the problem and building intuition than most other tutorials that I have seen, so might be of interest today. The original title was “A Really Brief Introduction to Pattern Classification”.

Classification

Classification, in this context, means using math to apply a categorical label to raw data. Examples might include:

  • Who is speaking?
  • Was there an earthquake? In which grid space?
  • Is that a bad sensor?
  • Is the recording a marine mammal or a cargo ship?

These problems go by a bunch of different labels, depending on the field, the algorithm, and the type of result. The current trend is to call everything “machine learning” or even “artificial intelligence”, but I have also heard “hypothesis testing”, “decision theory”, “pattern classification”, “detection theory”, and many more. For say we are building a classifier when we do this:

There are many techniques that have been developed to perform these tasks, including likelihood ratio tests, nearest neighbor techniques, support vector machines, decision trees and neural networks.

It is worth nothing that humans are generally excellent at doing this sort of classification. People give lots of reasons to automate the decision, including speed, reliability, and privacy, but really the reason it is done most of the time is cost. For many problems it is prohibitively expensive to have trained humans hand review every example.

Our Example

Today we will attempt to build an algorithm that decides if an audio recording is of a marine mammal or a cargo ship. For this problem, the three steps are:

  • Receive Data — digitized acoustic time series (audio file)
  • Extract Features — Properties of the audio. For example, formant frequencies, standard deviation, modulation rate, etc.
  • Assign to a class — Based on the values of the features, pick the most likely category.

So let’s take a wild guess for the feature selection part, and say that ships are often louder than whales. This is a supervised classification problem, meaning that we have some already labeled samples that we can use to design our classification algorithm. We can plot the magnitude (loudness) of 200 different labeled recordings of the marine mammal (blue) and cargo ship (red).

Magnitude (loudness) of 200 recordings of whales (blue) and ships (red).

It does seem like the cargo ship does generally have a higher magnitude than the marine mammal, but a histogram will give us a more clear summary.

Histogram of the values of the previous plot.

Yup, cargo ships are usually louder! (in the sense that their average magnitude is higher).

A very simple classifier

Remember, the goal here is to design a mathematical method for deciding if a new (unlabeled) recording is of a whale or a ship. If we were to approach this problem totally naively, we might just pick a threshold and decide that any future, unlabeled recording with a magnitude above that threshold is a ship, and any below that threshold a whale.

Looking at the histogram, we might pick a threshold of 0.8, which gives us the black line as a decision boundary.

Histogram with a decision boundary (black).

We can write this down with simple math, where x[n] is our feature (the magnitude of the recording), and H is the hypothesis that we are testing,

And that is it! We have a very simple classifier — choose marine mammal if the magnitude is below the threshold, and cargo ship if it is above.

A more principled approach

Of course, there are some problems. One is that histograms are very sensitive to bin width and placement. Different histogram bins with the same data can cause the threshold to be in a different place:

Oops, the same data with modified histogram bins gives a different decision threshold.

One way to fix this is to be more rigorous about hypothesis testing. The conditional probability notation here is pretty easy to read, but might be unfamiliar:

We want to know is which hypothesis is more likely, which in probability terms can be written as

That is pretty wordy, so a shorthand was invented to describe the hypothesis test:

The problem is that we don’t actually know those probabilities! So we are going to start making some simplifying assumptions.

First, we assume that both hypotheses are equally likely (at least, before we know our measurement x). This might be a reasonable assumption when we have some historical data that suggests it, or when we have no idea about the relative frequency.

Second, we assume that we want to minimize the total number of errors.

Given these assumptions (and some math), the hypothesis test can be written as

Notice that the terms are reversed from the hypothesis test above. Now we are using the probability of the measurement given that the hypothesis is true. This is much easier!

It is worth spending a minute on this. We want to know how likely it is that a specific sound came from a ship or a whale. What we are saying here is that if we can figure out how likely the measurement is given each hypothesis, that is good enough. In other words, if we know the distribution of loudness for each hypothesis, we can build our test!

Because we have labeled samples from each class, we can estimate the distribution of loudness for each hypothesis. In fact, that is exactly what we did with our histogram decision boundary — so that wasn’t just a good guess for an algorithm, but it was also pretty close to an error minimizing classifier.

To write down the classifier in closed form, we need one more assumption — that the probability distributions are Gaussian (also called normal distribution or bell curve). This is true surprisingly often because of a really elegant result called the central limit theorem.

To estimate the Gaussian distribution given some data, you just calculate the mean and standard deviation under each hypothesis. In our example, that looks like this:

Gaussian distributions for whales (blue) and ships (red) with the decision boundary (black).

One nice thing about this classifier is that there is a simple, closed-form expression for it (which I am leaving out to minimize the math — I think the pictures are more clear).

Measuring classifier performance

So how well does this classifier do? There are four cases to consider. First, the classifier correctly classifies marine mammals (which we have been calling hypothesis 1) in the blue shaded area because these recordings have a magnitude (loudness) that is below our threshold.

Blue shaded area represents the probability of correctly classifying marine mammals.

Similarly, the shaded red area below represents the correct classification of ships (hypothesis 2) because they are above the threshold.

Red shaded area represents the probability of correctly classifying ships.

Our classifier isn’t perfect — some of the marine mammals are louder (higher magnitude recording) than our threshold, so are incorrectly classified as ships.

Blue shaded area represents the probability of incorrectly classifying whales as ships.

Similarly, some of the ship recordings are below the threshold, and so will be incorrectly classified as whales.

Red shaded area represents the probability of incorrectly classifying ships as whales.

These errors sometimes have odd names like “Type I Error” or “Missed Detection”, but they are all fundamentally the same thing — examples where the classifier we designed gets the label wrong.

Relaxing our assumptions

So we made three assumptions in getting here:

  • Hypotheses are equally likely.
  • We want to minimize errors.
  • The feature (magnitude) has a Gaussian distribution under both hypotheses.

The good news is that each of these assumptions can be relaxed. First, let’s imagine that we expect to try to classify twice as many recordings of ships compared to whales. We can just increase the probability proportionally!

Decision boundary when ships are twice as likely.

If the goal is still to minimize the total number of errors, the decision boundary should shift to the new crossing point. This is intuitive, because we are still labeling each signal with the most likely hypothesis.

Unfortunately, in this example, it looks like we are only going to label half of the whales correctly. Sometimes it turns out that the errors don’t have the same importance, meaning that incorrectly labeling a whale as a ship might cause more damage than incorrectly calling a ship a whale. If that is the case, there is again a closed form solution, and again it is just moving the threshold, this time to increase the total number of errors, but decrease the number of misclassified whales. This is eliminating our second assumption.

The decision boundary has moved, increasing the total error rate, but decreasing the number of incorrectly classified whales.

Let’s go back to including those assumptions, so things look like this again:

Our classifier!

How well does it really work?

Now, let’s run this on some data! Often when designing classifiers we will use some of the data to train the classifier (in this case find the means and standard deviations to determine our threshold), and then use the remainder to make sure the classifier is working. This is called cross validation, and is important because it is hard to tell how well a classifier is working when we used the exact same data to design the classifier!

In this case, it turns out the error rate is higher than we expected given the model above.

Cross validation gives higher error than we expected. Area shaded red is ships that were incorrectly labeled as whales, and the blue shaded area is whales that were incorrectly labeled as ships.

What could be causing this? Well, in general it could be a lot of things including incorrect estimations of the Gaussian parameters, data that changes over time (non-stationarity), or a bad model. In this case, it turns out that our third assumption, that the distribution of signal magnitude is Gaussian, was wrong. I can tell you that because this is a synthetic example, and the actual distribution looks like this:

Actual distribution of magnitudes for whales (blue) and ships (red).

So why does this cause our error rates to be different than we expected? Well, because we put the threshold in the wrong place. The green line is where we incorrectly placed the threshold based on the Gaussian assumption, and the black like is the optimal decision boundary to minimize the total number of errors.

Black line shows the correct decision boundary for this case, while the green line is what we estimated when we incorrectly assumed both signals were Gaussian.

So how do we know what the right distribution is? Well, sometimes we have a guess from the physics of the situation. Gaussian distributions, for example, often arise when many independent factors contribute to the total measurement in an additive way (this is called the Central Limit Theorem).

Another good way to tell is just to look at the data. Scroll up to that histogram above. The red line doesn’t look like a bell curve to me!

Estimating probability density functions

So what do you do when you don’t have a good guess about the distribution? I am fond of a non-parametric approach called “kernel density estimates” (of course, there are lots of other methods). Here is how they work:

You take the first data point (the blue circle on the x-axis), and you draw a Gaussian centered on it like this:

Then you do the same thing with all the other data points:

Finally you sum them up:

That’s it! The width (variance / standard deviation) of the individual Gaussians can be found with a simple formula, and depend on the number of data points and how spread apart they are. So what do we get for our problem?

The dashed lines are the true, underlying probability distributions, and the solid lines are the kernel density estimates with a small data set. I like the kernel density estimation approach because it supports multi-modal distributions and has the property that it approaches the true distribution given infinite data.

In this example we do get a decision boundary that is closer to optimal than when we assumed everything was Gaussian, but it still has a high error rate — lots of ships and whales are incorrectly classified.

So is the problem hopeless? Well, obviously not. For one thing, we can listen to the recordings and immediately tell the difference. And part of the reason is that we listen to more than just how loud the signals are! Finding the right features to put into a classifier (or other machine learning algorithm) is often more important than the actual algorithm that is used.

Adding features

When we listen to our recordings of whales and ships, one difference that jumps out is the frequency content. There are many ways to summarize this as a feature, but for a first pass we will just take a Fourier transform of the signal and pick the peak, which is the strongest single component. Now our data looks like this:

Scatter plot of labeled whale (blue) and ship (red) data with two features, the magnitude and frequency.

If we were to collapse these data to just magnitude and make a distribution of that, we would see the same plots as above. This process is called marginalizing. Here it is for magnitude with the true distributions shown.

Our scatter plot with marginal distributions for magnitude overlaid. The axis label for the probability distributions is not shown.

And here is the feature scatter plot with the marginal probability distributions for frequency shown:

Our scatter plot with marginal distributions for frequency overlaid. The axis label for the probability distributions is not shown.

We could design two classifiers, one for each feature, and combine them somehow to make a classifier that uses them both. A small wrinkle is that with our frequency feature the Gaussian have the same mean, so we would end up with two decision boundaries — choose whale (blue) when the frequency is high or low, and choose ship when it is in between.

But we can do better! Notice that for the whales (blue) our two features are highly correlated (not independent). When the frequency is high the magnitude is also high, and when the frequency is low the magnitude is also low. To be explicit, the black circles in the plot below show what the scatter plot would look like with the same marginal feature distribution if the two features were in fact uncorrelated).

Whale samples (blue) are correlated. For comparison, the black circles have the same distribution in magnitude and frequency, but are uncorrelated.

We can use the correlation between the features to our advantage, and build a better classifier by using the features together!

Maximum likelihood classification

Jumping ahead, if we know the distributions exactly and make the assumptions about the classes being equally likely and wanting to minimize total errors (miss-classifications), we can draw the optimal classifier boundary (called a likelihood ratio test):

Optimal decision boundary to minimize total errors is shown in black.

This decision boundary is the best we can do given these features — it is optimal in the sense that it will minimize the total number of errors in classifying future unlabeled samples.

Of course, we usually can’t find the optimal decision boundaries because we don’t know the underlying distributions and how they are correlated with each other. The relationships can even be nonlinear, and when we use many features simultaneously, estimating them all accurately is hopeless.

So what do we do? How can we guess, given this data, what the decision boundaries might be? This is the heart of classification, and is where many of the machine learning algorithms you have probably heard of like neural networks, support vector machines, deep learning and more come in.

Linear classification

The first thing we might try is finding the best straight line to draw — the one that minimizes the number of labeled training samples that are on the wrong side of the line. This is what very simple neural networks do, and modern classifiers like support vector machines build on this general idea. When there are more feature dimensions we fit what is called a hyper-plane, which is just a perfectly straight decision boundary in more dimensions.

You can see the result for the optimal linear classifier below. While it gives quite a different shape than the true optimal decision boundary shown above, it is clear it does a pretty good job of separating the classes — most of the whales are on one side of the line and most of the ships on the other.

Optimal linear classifier.

Nearest Neighbor

Let’s try something totally different — a surprisingly simple and powerful approach called “nearest neighbor classification”. Given a new sample to label we will just look in the labeled samples we have and pick the one that is closest. That’s it! This is called nearest neighbor classification, and has the surprising property that given infinite data it will converge to an error rate no worse than twice as high as the optimal likelihood ratio test. In practice, it often works shockingly well given how simple the approach is.

Here is a plot of the decision boundaries we end up with for nearest neighbor classification:

Decision boundaries for nearest neighbor classification. Simply assign the class from the labeled data point that is closest to the one you are trying to classify!

Decision Trees

Let’s look at one more way to draw the decision boundaries for this problem: decision trees. Glossing over the math, decision trees design a series of decision boundaries that operate on a single feature at a time. In our case that means decision trees can only draw vertical and horizontal lines, but because of the tree nature they can be nested lines.

Here is the top node in our decision tree, which assigns whale if the magnitude is less than ~-0.2, and ship if it is above. Obviously, this isn’t a good classifier!

Top node in our decision tree classifier.

The beauty of the decision tree is that we now take each half of our region and split it along the other feature, like this:

Decision tree with two levels.

We can then continue to nest additional decisions and end up with a more complex decision boundary:

Our complete decision tree classifier.

Each rectangular area is assigned the class matching the majority of the training data in that rectangle. While this decision boundary looks kind of odd, if we overlay it on our optimal (maximum likelihood) classifier from before, we see that it does a surprisingly good job of matching it, given the limitation of only using perpendicular lines.

Decision tree class boundaries (blue) overlaid on the optimal maximum likelihood decision boundaries (black).

Generalization

We have now seen four different ways of dividing up our feature space — maximum likelihood, linear decision boundary, nearest neighbor and decision tree. There are tons of other classifiers, but a big secret of supervised classification is that they all do this same basic job — dividing up our feature space so new data can be automatically labeled.

Some of you might be wondering if we can just draw an arbitrarily complex decision boundary and get 100% accuracy, like this:

A “perfect” classifier that will almost certainly perform badly in the real world.

The problem is that this classifier is likely to perform really badly on new data — it is over fitted. There are many techniques to assure we have classifiers that will generalize to new data well, and the most common is cross validation, where some of our labeled data is not used in training the classifier, but rather in checking how well it actually works.

Classification is easy

And that’s it! At least conceptually, that is the basis of using machine learning to leverage a set of labeled data to guess labels on new data.

Of course, there a ton of challenges. All too often data scientists focus on choosing the right classifier for the problem, handling the complexities of larger feature spaces, computational complexity, etc. As we saw above, choosing the right features is often more important, which is part of why data access and cleanliness is such a big aspect of data science today.

There are also engineering challenges — machine learning is often hard to ship and maintain in a production environment because it has more complex data dependencies and other needs than typical code. Finally, there is the business challenges, namely finding a problem that will benefit from classification (often with imperfect results) and figuring out how to get labeled data to start with. Frankly, I have seen business and engineering problems kill many more machine learning projects than accuracy and algorithms.

But in the end, it really is as simple as picking some features and drawing a line. The math and jargon should not scare you off — the libraries and tools to do classification are getting more accessible all the time.

Next Steps

If you want to see all of the math behind these methods and develop your intuition, I highly recommend the book “Pattern Classification” by Duda, Hart and Stork. It is extremely clear, develops intuition well, and introduces the more unusual math that is needed gently. Ignore the fact that the most recent edition is almost 20 years old — linear regression hasn’t changed one bit, and I haven’t seen a better book on the basics yet!

If you want to jump in and start using these methods, I highly recommend Jupyter + Python 3 + scikit-learn. Other technology stacks will work too, but I really like that one.

Good luck!

--

--

Colin Jemmott
Seismic Innovation Labs

I am a data scientist at Seismic Software and lecturer in the Halıcıoğlu Data Science Institute at UC San Diego. http://www.cjemmott.com/