No free lunch theorem, lookup tables, and smoothness assumption

Joseph Gardi
7 min readMay 28, 2022

Please install the math anywhere chrome plugin to see the equations rendered. Or better yet just go to the version on my website: https://jg-blog.surge.sh/blog/no-free-lunch

This is part of a series but can be read on its own.

A theory that explains everything, explains nothing — Karl Popper¹

An example of smoothing. From https://i.stack.imgur.com/0sdj2.png

No free lunch theorem was first proved by David Wolpert and William Macready in 1997. In simple terms, The No Free Lunch Theorem states that no one learning algorithm can generalize successfully for all datasets. The implication is that it is crucial to choose the right learning algorithm for your dataset.

Generalizing means making predictions from past data points about data points that you haven’t seen before. It is easy to see that it is impossible to generalize without making assumptions about the data. If I tell you that f(0) = -1.2, f(1), = -2, f(2) = -1, f(2.2) = -2, f(3) = 0, can you mathematically prove what the values of f(2.5) or f(4) are? You clearly can’t unless you make some assumption about the function f. One of the most commonly applied assumptions in machine learning is the smoothness assumption: similar inputs have similar outputs. A useful interpretation of smoothness is that the function’s derivative tends to be small. By applying the smoothness assumption to f, we can infer that f(2.5) should be somewhere between f(2.2) (which we observed is -2) and f(3) (which we observed is 0). So we expect f(2.5) to be about -1. In machine learning, we call these assumptions for generalizing the inductive biases of the learning algorithm.

If the set of possible inputs is discrete and bounded so that there are a finite number, n, possible input values, then we only need to know the value of f at those n values and store it in a lookup table. That’s called memorizing the data, and no generalization is required in that case. We can solve many problems with ease this way, but that doesn’t work for continuous values or when n is unreasonably large. Since ML happens on computers, all the supposedly continuous values will be floating-point numbers. These floating-point numbers are most often 32-bit, so there are usually finite n=2³² possible values. However, 2³² seems unreasonably large for a lookup table. Such a lookup table would require billions of data points. Therefore, it is often necessary to generalize.

The smoothness assumption matches physics well. While many things (perhaps all things) are discrete at the lowest level, the classical physics that describes the macrolevel world we’re all used to has mostly smooth equations. Moreover, Einstein showed that velocities are bounded by the speed of light. This means that given a small $\epsilon > 0$, we can show that for all objects along any axis $x$, $|x(t + \epsilon) — x(t)| < c * \epsilon$ where $c$ is the speed of light. Equivalently, $|x’(t)| < c$. In practice, most of the velocities we deal with are much smaller than the speed of light. So the position is smooth with respect to time.

One of the simplest and most general inference algorithms using smoothness assumption is quantization + lookup table. It can approximate any smooth function with arbitrarily high precision given sufficient data. It is a universal approximator of smooth functions. The quantization discretizes the inputs by multiplying the input numbers by some scale $s$ and then rounding to the nearest integer. If our inputs are all within some bounded range from $a$ to $b$ then the total number of possible values for the quantized inputs is $s(b-a)$. Then a lookup table, applied to the quantized data, only has to store $s(b-a)$ possible values. So we can save memory at the expense of accuracy by shrinking the scale $s$. Meanwhile, a larger scale results in more accurate predictions but requires a bigger lookup table. With enough training data and a sufficiently large lookup table, any smooth function can be modeled with whatever needed precision.

To illustrate, we model the function f(x) = x * sin(x). The point is to make predictions at points we have not already seen, so I take a sample of 500 points as training data from which our learning algorithm learns. The code for making this plot is below the plot.

The blue is the true function we want to model. The orange is the results from the lookup-table predictor with quantization scales of 1, 2, and 50 from left to right. The flat line at -10 in the plot on the right is from inputs that were not found in the lookup table.

Definition: no free lunch theorem

There does not exist a learning algorithm that can generalize for any dataset because the learning algorithm can only generalize on datasets that fit its inductive biases. Generalizing means making accurate predictions for data it hasn’t seen before.

Bias-variance tradeoff

A more intuitive name for our purposes could be specificity-generality rather than bias-variance tradeoff. The bias side of the tradeoff corresponds to learning algorithms with stronger inductive biases. The drawback to learning algorithms with strong inductive biases is that they only generalize if the inductive biases are true for the data you apply it to. Inductive biases are the learning algorithm’s assumptions, and if your assumptions are wrong, then the inferences that follow will be wrong. Learning algorithms on the variance side of the bias-variance tradeoff require a large amount of data to learn but are more general in that they can generalize well on a wider variety of datasets when given enough data. The lookup table is the most extreme case of a learning algorithm on the variance side of the tradeoff. It has no inductive biases and works on everything, but it requires a data sample for each possible input. Combining the lookup table with quantization brings it closer to the bias side of the tradeoff. Using a larger scale for quantization brings it further towards the bias side of the tradeoff. Also, see the detailed definitions of bias and variance from classical statistics, but those definitions are not needed for the rest of this book.

Bayes theorem helps us understand this tradeoff more precisely. The probability of some hypothesis (a hypothesis represents our inductive biases) $H$ given some data $D$ is

eq. 1: $P(H|D) = \frac{P(D|H)P(H)}{P(D)}$

Bayesians call $P(D|H)$ the evidence for the hypothesis because Bayes theorem shows that the probability of the hypothesis given the data is proportional to the $P(D|H)$. We can’t fully calculate $P(H|D)$ because we don’t know what P(H) or P(D) are. In fact, fully calculating $P(H|D)$ would violate no free lunch theorem. However, we do know that the sum of $P(D’|H)$ over all possible datasets $D’$ will be equal to 1: $\sum_{D’} P(D’|H) =1$ . Therefore, $P(D|H) =1-\sum_{D’\neq D} P(D’|H)$ where $\sum_{D’\neq D}$ represents a sum over all possible datasets other than $D$. So in order for the evidence to be high, the probability of datasets other than $D$ must be low. Therefore, no hypothesis can fit the given dataset well while also fitting all other datasets well.

See some visual examples of the bias-variance tradeoff in the video below.

Understanding the mathematical background of bias-variance tradeoff is advised but not crucial for this series . I would start with the bias-variance decomposition of the error.

Remark 1. We become more certain that $H$ is true as we collect more data as long as $H$ fits that data well

If D is a matrix of $n$ independently sampled points then we can expand eq. 1 as,

$P(H|D) = P(H)\prod_i^n \frac{P(D^i|H)}{P(D^i)} = P(H) \cdot M^n$

where $M$ is the geometric mean of the ratios $P(D^i|H)/P(D^i)$ for each $i$. If $P(D|H) > P(D)$ then the geometric mean $M$ must be greater than 1. This happens when the evidence is relatively high. In that case, $P(H|D)$ will increase exponentially with $n$ assuming that the geometric mean $M$ constant as $n$ increases. $n$ increasing means collecting more data and $M$ remaining constant means that the hypothesis $H$ is generalizing to the newly collected data. The upshot is we become more confident that $H$ is true as we collect more data and see it continue to be accurate on the new data.

Definition. regularize:

Like the verb form of bias-variance tradeoff. Regularizing is modifying a learning algorithm to move it closer to the bias end of the bias-variance tradeoff. For example, quantizing is regularizing the lookup table.

Repeat Our Experiment with Less Training Data

Quantization divides the set of possible inputs into bins and a smaller quantization scale means smaller bins. The drawback is a smaller quantization scale requires a bigger lookup table and more data. If there is no data for one of the quantized values then the predictor does not know what to do. In this implementation, it just arbitrarily returns -10. So in the plot on the right where the quantization scale is 50 you see a lot of values at -10 because there was no corresponding value in the lookup table. There will be more details on this problem in a later post on the curse of dimensionality. The plot in the middle with a quantization scale of 2 looks like a better option than the one on the right. If we repeat the same experiment with 50 data points rather than 500, then the one on the right misses almost everything:

Notice that the predictor fits less well in areas where the derivative is large because the smoothness assumption (which means the derivative should be small) is “less true” there.

The no-free lunch theorem raises interesting questions about the philosophy of science which I address next in “How is Science Possible?”.

I hope you found this helpful. Any feedback or suggestions for this would be appreciated! 🙏

¹Although most sources I checked attribute this quote to Popper, evidently, there is some dispute about its origins. Also, this citation itself comes from Gregory Bond. A search for this quote on google scholar gives many results. It’s often used to summarize Popper’s arguments for falsifiability.

--

--