Making data meaningless so AI can map its meaning
By David Weinberger
AI Outside In is a column by PAIR’s writer-in-residence, David Weinberger, who offers his outsider perspective on key ideas in machine learning. His opinions are his own and do not necessarily reflect those of Google.
Suppose you want a machine learning system to suggest paint names based on any color you specify. This has been done hilariously by Janelle Shane — “burf pink,” “navel tan” — but let’s say we want to do it more seriously (and without any reference to how Shane actually did it).
Machine learning, at least of the common sort called “supervised learning”, learns from the data you give it, so you first want to gather a large set of colors to which humans have applied various labels. You might start by pulling in paint colors and their names from the online catalogs of all the paint suppliers you can find. You do this until you have thousands of named colors.
Now you have your machine learning system see what it can discover about the relationships among the words and colors without any guidance from you. In this imaginary case, all the system will know is that the word “party” has been applied to 150 colors, most of which are pinkish, that “dusk” has been applied to 200 colors, most of which are subdued, “royal” has been applied to 300 colors that don’t seem to have a lot to do with one another, “happy” has never been applied to any shade of gray, and so forth.
Colors into Numbers
Of course that characterization wasn’t exactly accurate. Machine learning systems don’t know what colors or words are. All they know are numbers. The colors are easy to turn into numbers because they can be expressed as mixes of different quantifiable levels of red, green, and blue, as in the RGB standard that assigns each of those colors a number between 0 and 255. For example, a teal-ish swatch might be 66 parts red, 244 parts green, and 209 parts blue. (RGB colors represent light, not paint, so they don’t mix the way we expect. Try them here.)
But how do you assign numbers to words?
You don’t. You let the machine learning do it via a process called “embedding.” It is rather awesome.
How embedding works
First you have to turn the words into their building blocks so the system will be able to recognize possibly meaningful connections. For example, you want it to suspect that “sea”, “seabreeze”, and “seascape” all have something to do with one another. So, the system will break those words up into tokens that represent “sea”, “breeze”, and “scape” and will look for correlations among them. Some of the correlations may be quite weak. For example, while the words “horseplay”, “horserace”, and “horseradish” all contain the same token “horse”, a machine learning system is likely to figure out from usage and context that the words are only loosely related. (Note that Janelle Shane’s color naming machine learning systems seems not to have tokenized words, but treated them as simple strings of letters. That’s why the system came up with so many hilarious non-words and near-words.)
Now you’re going to set your machine learning loose on the colors and their tokenized labels. It’s going to notice simple relationships among the colors and names, such as that the label “pink” seems to be applied to colors that have very high reds, very low greens, and fairly high blues. The system may also notice that many colors with high blues, and greens that range from mid to high, create a sky blue color and have names that often include the words “sky,” “sunny,” “day,” and “above.” And it notes that “sunny” also shows up in colors that have very high reds and greens, because those two make yellow.
As the machine learning system notices those similarities, it assigns each word a point in a three dimensional space. Except three dimensions aren’t enough to position it with regard to all the other words, so imagine a thousand dimensional space. (Let me know if you succeed at imagining that, and be sure to include a picture :) “Sky,” “sunny,” “day,” and “above” are likely to be positioned close to one another because of the similarities of the colors they name (among other things). “Sunny” will also be close to the yellow-ish colors. How close depends on how often it’s used for similar colors. The closeness indicates the relationships discovered by the machine learning system.
The machine learning may go past simple color associations. It may notice that “pale” and “wash” are both used for lighter colors, no matter what the hues are. It may notice that “mediterranean,” “pastel,” and “summer” are often used when two of the three constituent RGB colors are high but the third one is in the middle. All of this can affect the position assigned to each word, for each bit of information can and should affect where the word stands in relation to all others.
In the language of computer science, each of these sorts of relationships — the relative strength of each of the three constituent hues, the sharing of labels, etc. — constitutes a “dimension,” and the words will each be given a number for each dimension, indicating their relationship to the other items in that dimension. A machine learning system might unearth thousands of dimensions — ways in which data are related — resulting in hundreds of different numbers being assigned to each word. For example, “summer” might be numerically close to yellow-ish colors when looking at hues, but might have a different number that brings it close to “pastel” in terms of color intensities.
Here’s a friendly video that graphically illustrates the “high dimensional space” embedding creates:
Having churned away discovering relationships among words and colors, words and words, and colors and colors, the system will now be ready for a user to input a color and get back the words the system is most confident are associated with that color, even if it’s a color the system has never seen. Or input a couple of words and it could perhaps make up a color it thinks represents them.
The system can make these wondrous — and sometimes ridiculous — connections between meanings and colors only because it replaced meaningful words with meaningless numbers and tokens. Only then could the system find meanings that surprise us, amuse us, and may even feel exactly right.