The Profound Possibilities of Knowledge Compression
Of the many ways to think about machine learning intuitively, I think its use as a way of compressing information is one of its most useful. We often get so lost in the possible world-changing abilities science fiction promises that we lose sight of this simple fact.
The idea is that machine learning is compressor of knowledge. To take a literal example, we could, potentially, build a table of all possible inputs and all possible outputs and simply lookup that information. Reinforcement learning has actually used this as a premise with simpler toy examples. But as the number of all possible inputs and outputs grows, a lookup table becomes unreasonable to use. So we build functions that compress this knowledge to enable better compute efficiency and smaller models. We can actually see this play out in real examples.
Take one of the most basic models with decision trees. These work like questionnaires, with branching values based on the answers given. A model tallies up the answers and returns a value based on this. They end up being a lot like piecewise functions, a topic American school children learn in the 7th grade.
We can then look at more complex functions like support vector machines, also called SVMs. Yann LeCunn has childed these as “glorified line fitting”, and in a way he’s right.  SVMs work by finding the line that best separates two known shapes. Usually this plays out in high dimensions with the features forming the nth-dimension. New data has its features fall on either side of this line and its then categorized as belonging to that particular group.
More complex functions still work this way, with ensemble fitting like random forests fitting many different functions and then picking results where the majority agrees. These simpler functions could be decision trees, whereby ten decisions trees are trained and if six of them say new data falls under some label, then the new data is predicted as belonging to that label.
But then we get the mack daddy of all machine learning algorithms: the neural network.
You see, they have the truly miraculous ability to recreate any function given enough neurons and data. This is called the Universal Approximation Theorem and its one of the few things we know for sure about how they work.
This known and provable fact about neural networks forms the backbone of another powerful concept called model compression. The idea behind this is that very complex ensembles of functions, think some 500 different ones, can be modeled by a much smaller and simpler neural network. Having large ensembles can be very accurate but also very slow to run, as the 500 models need to be individually computed. They can also be bulky as each model needs to be stored in memory or disk. By speeding this up with a neural network trained specifically to replicate this model, we can have huge performance increases. 
You can see this idea play out all over the place. I recently wrote up about using neural networks to replicate LIGO wave detection as just one example: the many many templates fitted to a particular wave is slow to compute but can be sped up if we train a network to mimic it.
This simple idea is so powerful that I’m honestly surprised its not used more often.
The process is simple: take a lot of inputs, pass them through a known function, and get a lot of outputs. You use the inputs as features and the outputs as target values. You then fit a neural network to this and purposely overtrain it. It will start to replicate the function, plus or minus some noise. This noise is usually so infinitesimally small that it’s not worth worrying about.
When one can start to more quickly do functions, a Moore’s Law like runaway effect occurs. The bottleneck moves from the computation of the function to somewhere else on the assembly line of knowledge. This allows for effort to be better spent elsewhere, overall improving efficiency, productivity, etc. for everybody involved.
In practice, there are some details we don’t know about. We don’t know why some activation functions perform better than others. It’s also difficult to know why multiple layers can help sometimes, though I have seen some people theorize on this. The specifics of network architecture in general is now known as well. But the idea is still a powerful addition to the mental heuristics of thinking about machine learning.
It’s a tired but honest cliché that the hype is overrated, but there’s often a kernel of truth buried at the center. In machine learning, it’s unfortunate that we’ll never have super human assistants or perfect fits-like-a-glove home automation. At least I don’t think so. But we can use some of the core properties, like the ability to mimic any function, to make a dent in the universe and move the human race forward.
: I can’t find the video anymore, but if anybody can point it in my direction, I’d be very grateful to add it!
: It’s also distinctly different than training a neural network on the data itself. This is a result of the overtraining tendency of neural networks, which is complicated enough to get its own post.