Kernels, Polysemy and AI

Musings on how human language affects artificial intelligence

Same-padding convolution of a tapered cylinder with elliptical kernels. Stride = 0.
If language and thought are closely related, it could be that human language hinders thinking: our thoughts might be forever trapped by faulty language. Nor is it obvious how to evade this trap, for the only escape would seem to involve discovering and repairing the defects of our language, and that’s something we could do only from “inside” the language itself.
- Andrew Brook, Knowledge and Mind

In AI, there are two related uses of the word “kernel” that are polysemic, meaning that they are neither homophones, nor ambiguous, nor vague but rather have interrelated meanings within a semantic field.

The first use of “kernel”, which I will call α kernel, refers to a small window that one slides over an n dimensional surface, such as a 2D image, in order to detect features. The second kernel, β kernel, is an elemental building block of a Machine Learning model from which larger models are composed. For example, in the Tensorflow source code, there are hundreds of files in the directory core/kernels that contain many low-level operations at the heart of Tensorflow’s design.

While both versions of the word “kernel” are borrowed from the domain of image processing, they have evolved into distinctly different concepts. Even more interestingly, the two meanings of “kernel” in AI — convolution and core operations — repurpose the original polysemic usage of the word “kernel” with respect to nature. A kernel is both a core (apricots, cherries, peaches — β kernel) and also a small unit that covers a surface (corn—α kernel). In both cases, the kernel represents some sort of germ or seed, but the intuition we have about its relationship to the plant and to our diet is completely different.

At its heart, a polyseme is always an affirmation of human poiesis. It is our way of creating new meaning from old language. However, polysemes can also be dangerous for precisely this reason: they have the potential to reinforce dominant conceptual frameworks that limit our creativity when thinking about a problem. It is ironic that the word kernel, which suggests an immutable or axiomatic unit, has fallen victim to this sort of polysemic “fallacy” with the field of AI. Its multiple related meanings are ubiquitous in online documentation and scholarly articles, influencing our tools, our libraries, our curricula and our thought, often without our realizing it.

In this article, I will talk about how the α and β uses of the word “kernel”, while useful, are hampering innovation within the AI community. I will end by suggesting several paths forward, including one that we are using at Meeshkan Machine Learning.

AlexNet , α Kernels and Parallelization

AlexNet, which famously won the 2012 ImageNet Large Scale Visual Recognition Challenge with record accuracy, has its architecture described in the article ImageNet Classification with Deep Convolutional Neural Networks. The authors write: “The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers.” The emphasis added is my own.

Two parallel tracks for convolution in AlexNet.

Thanks to this paper, convolution became a driving force in AI from image recognition to sentence classification to speech recognition. These days, all major ML libraries contain convolution out of the box, and several of them have heavily optimized convolution (β) kernels. But interestingly, AlexNet’s parallelization strategy was never seriously evangelized in the AI community — most likely because people do not have multiple NVIDIA GPUs at their disposal.

What this meant concretely for industrial AI is that, instead of getting creative about parallelization, people articulated ideas through kernels and convolution. And it worked — convolution has led to major breakthroughs in automatic driving and face recognition. But the problem at the heart of AlexNet that GPU parallelization tried to address, namely that it is tough to get hyperparameters like kernel-size and stride right, has plagued convolution to the point where papers like Going Deeper with Convolutions offer the hilarious (and effective!) solution of packing lots of kernels all at once into the same feedforward network. A more nuanced approach, like starting out with many kernels running in parallel and keeping the most promising ones or modifying them over time, is well outside the scope of AI libraries and falls in the domain of esoteric research. But it is precisely this type of creativity that has the potential to be truly disruptive on an industrial scale because it is adaptive and thus well suited for evolutive data processing and predictions.

Tensorflow, β Kernels and the XLA Compiler

Nestled deep in Tensorflow’s sprawling documentation is information about incubating feature called the XLA Compiler. The maintainers of XLA, short for Accelerated Linear Algebra, write that “The benefit of this over the standard TensorFlow implementation is that XLA can fuse multiple operators (kernel fusion) into a small number of compiled kernels.” The emphasis added is my own.

A visualization of addition and multiplication operations in a five-layer fully connected neural network produced by the Meeshkan graphviz backend and rendered by Gephi.

While the XLA project offers a compelling narrative about its theoretical impetus, in practice, very few projects have reported back significantly faster results using this compiler. I have not had any in my own projects either. This is no fault of the compiler, but rather that the compiler is best suited for a class of ML models that people are not writing. If at some point your fully-connected neural net splits into several separate paths and then joins back together, then it would make sense to clump cohorts of similar operations and “fuse” kernels, especially when one uses tools like Apache Spark and OpenCL that take advantage of distributed and accelerated resources. However, in practice, kernel fusion is largely ineffective precisely because the dominant ML tools and paradigms neither facilitate nor encourage the writing of algorithms where data flows through parallel circuits before being recombined at a confluent. The main operations that do happen in parallel, like multiplications in a dot product or convolutions, are already executed by heavily optimized C code and need no additional fusing. In other words, out-of-the-box kernels in AI are so efficient that we settle for them without digging deeper, making novel tools like the XLA compiler seem less useful than they actually are.

Life After Kernels

Language hinders intuition, it hinders open-minded experience.
- Tina Röck, Process Ontology: A Speculative Philosophy of the Concrete

If you agree that the word kernel, used in two different contexts, has led to a suboptimal way of thinking about certain AI problems, I have not yet related the two uses of the word, nor have I described what the real problem is. If anything, the two uses of the word float around in different circles — even the most intrepid CNN builder doesn’t need to know what a Tensorflow kernel is and, vice-versa, someone optimizing a Tensorflow kernel may never work with convolution. So what’s the problem?

If a polyseme implies a semantic continuum, then “kernel” exists on a spectrum of “fixedness” or “stasis”. By comparison, the polyseme “church” exists on a continuum of inclusiveness, space, and place: the immaterial Lutheran church is different than the concrete Temppeliaukio Church in Helsinki, but both uses of the word “church” help to circumscribe a conceptual unit to which someone belongs that includes both a physical space and a spiritual community. Similarly, a convolutional α-kernel is a geometrically invariant unit that moves along a surface and a dot-product β-kernel is a highly optimized linear algebra operation, but both reinforce a notion of elements that are “fixed” in Machine Learning. This fixedness is born from a host of practical and theoretical considerations in ML— for example, GPUs are great at repeating the same operation multiple times in parallel (practical) and the neurological underpinnings of optical recognition are loosely related to convolution as implemented in AlexNet (theoretical). But the end result is that Machine Learning tends to first fix kernels early on and then tweak how they’re used after running several epochs against a training set.

Helsinki’s Temppeliaukio church

The more exciting process, of course, is the AI that does the tweaking itself, but this sort of advancement can only happen once we modify the dominant “kernel” paradigm that motivates our benchmarks, our courses, our tools and, most importantly (and dangerously) our language. It is trite to simply say that “things need to be more open” or “processes need to be more dynamic,” but in this case, we need to fundamentally rethink our architectures so that the basic unit of design is not the kernel but the flow of information. In other words, going back to AlexNet, I am imagining a world where the real takeaway is not that they used kernels, but that they creatively combined two GPUs.

This type of design decision is easier than you might think. Just this month, a fantastic paper came out called IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures that gets at this issue. As the authors note, “IMPALA’s high data throughput and data efficiency allow us to train not only on one task but on multiple tasks in parallel with only a minimal change to the training setup.” But the main benefit, in my opinion, is that “IMPALA can make efficient use of available compute at small and large scale. This directly translates to very quick turnaround for investigating new ideas and opens up unexplored opportunities.”

Meeshkan and parallel AI

Meeshkan Machine Learning strives to do the same thing. By creating a network of thousands of devices that work together to run AI models, it can explore sets of problems, network effects and scales that are simply impossible on XPUs (CPUs, GPUs, TPUs) precisely because these machines are linked to the kernel paradigm. As we are all spectators to Cloud Wars, where industry giants compete to funnel terabytes of data into supercomputers with highly optimized kernels to make predictions, Meeshkan offers an alternative model for exploring how novel architectures can surprise and challenge us to be creative.

Meeshkan is currently nurturing these ideas with different partners — artists, businesses, non-profits — that share our vision. If you are looking to reach beyond the kernel paradigm, then please do not hesitate to try our service or to reach out to us! Even if our paths don’t cross in the world of Machine Learning, I encourage all of you to reflect on the way that language can both hinder and facilitate innovation in your own personal practice. Sometimes, we forget that humans create AI, and by being critical of how something as basic as language affects us, our artificial intelligence is boosted the most important and fundamental tool at our disposal — human intelligence!