PAC Learning Theory for the Everyman

An uncomplicated introduction to the theory behind supervised machine learning

Published in

The Startup

4 min readApr 20, 2020

Photo by ThisisEngineering RAEng on Unsplash

If you asked someone exactly how sure they were of their answer to a question you posited, and they responded with “My answer is probably approximately correct,” you might not have much confidence that they know what they are talking about. But the PAC Learning Theory, or Probably Approximately Correct Learning Theory is the foundation on which the learning part of machine learning is built.

First described in 1984 by world-renowned computer scientist Leslie Valiant, PAC learning theory manifests results of machine learning algorithms the same way children learn the differences between cats and dogs.

The natural phenomenon of young children learning is extraordinary. A spectacular facet of this learning is that, beyond remembering individual experiences, children will also generalize from those experiences, and very quickly. After seeing a few examples of apples or chairs, they know how to categorize new examples. Differenct children see diferent examples, yet their notions become similar. When asked to categorize examples they have not seen before, their rate of agreement will be remarkably high, at least within any one culture. Young children can sort apples from balls even when both are round and red. — Leslie Valiant, Probably Approximately Correct: Natures Algorithms for Learning and Prospering in a Complex World

But just as chairs, apples, dogs and cats have myriad differences that can be quite complex, when you fit a machine learning model in order to make predictions, the input data can vary wildly. The results of the algorithm attempt to reach a true conclusion, though making perfect predictions is unattainable. The predictions are then approximately correct.

The number of datapoints fed into your model also affects the accuracy of the prediction. Your model may correctly predict the fruit that is round, orange, and grown in Florida as an orange, but incorrectly predict the round, mostly green fruit that is grown in California as an apple when in truth, oranges can be green when unripe, and California produces the most citrus outside of Florida. The more instances of apples and oranges that are fed into your model will increase the trustworthiness of the results. Fewer instances will make your model more succeptible to randomness and noise in your data, while a large sample size will narrow your confidence interval around your population mean. Confidence intervals in statistics are a range of probable values within which the true population mean can be estimated.

The larger the dataset, the more confident we can be in our estimation of the true value of whatever we’re modeling, but we can never be 100% confident in our estimation. You may train your model on 100 million datapoints, but there still may be instances that occur in new datasets that didn’t occur in the training data that will then be impossible to predict. That’s why the results are probably approximately correct!

Unfortunately, we can’t completely ignore the mathematical portion of PAC learning theory, but I’ll relay it in very simple terms.

We know that f(x) = y, and given y, we can estimate h(x) which is as close as we can get to knowing the true function f(x). The results of our function h(x) may be successful or unsuccessful, where success is obtained when the error between f(x) and h(x) falls within the range deemed acceptable by the bounds of the predefined confidence interval. The values x used to train your model each have a certain probability of occurring that fall within probability distribution D, with the total probability equal to 1. Probability distribution D quantifies the error with which h approximates the true function f. The mathematical notation is as follows:

In plain English, the formula states that the error rate between our hypothesis function h and the true function f is equal to the probability P that x lies within the probability distribution D where our hypothesis function h does not equal the true function f. To learn more about the mathematical theory behind PAC learning, check this out.

So there you have it — a somewhat distilled version of a very complex subject, PAC Learning Theory. Keep in mind that while the foundations of computer science and machine learning can be quite daunting, being able to understand the rudimentary aspects can help those with varying levels of education be successful in the field.

PAC Learning Theory for the Everyman

An uncomplicated introduction to the theory behind supervised machine learning

Written by Allison Kelly