You’ve played Spot The Difference at some point, right? (Of course you did. We all did 😀). Think about what you actually did when you were trying to Spot The Difference — in fact, just do it again with SpongeBob juggling above. You’re probably doing something like
- Quick glance to the left.
- Quick glance to the right
- “Spatula on the left. Wonder if there is one on the right?”
- “Yup, there is. Sriracha on the right. Is that on the left?”
- “Nope! It’s a ketchup bottle on the left! And there is nothing on the plate below it!”
To summarize, what you’re basically doing is looking at each image alternately, identifying objects in each image, and trying to find the same object in the other image.
As it turns out, what most Deep Learning systems do is something quite different. They basically do something like this
- Figure out everything in the first image (“ketchup bottle at 1 o-clock, empty plate at 3 o-clock, spatula at 10 o-clock, …”)
- Figure out everything in the second image (“sriracha bottle at 1 o-clock, plate with star at 3 o-clock, spatula at 10 o-clock”)
- Compare the two lists, and see if they are identical.
(ok, they technically compare the extracted feature vectors of each image, but that isn’t relevant. It’s, kinda, sorta, like what I say above 😑) (•)
Why does this matter? Well, when you’re doing something specific — e.g. comparing signatures — then you can train your DL system to “Spot the Difference” for the universe of signatures, and, well, it JustWorks™.
Basically, you train your SignatureAnalysis™ system to come up with a list of everything there is in a given signature. That way, you can give it two signatures, and it can — easily! — come up with a “list” of everything in each signature, and thus compare them.
OTOH, if you gave the same SignatureAnalysis™ system the SpongeBob images from above, it would be totally lost, right? And that’s because, well, that’s pretty much just how DL works — it knows how to find unique things in Signatures, but only signatures! You would no more expect SignatureAnalysis™ to work with SpongeBob, than you would expect AlphaGo to be able to win at DOTA 2!
Unless, of course, you could get Deep Learning systems to be more, well, human. What I mean by this is that the the system would learn how to compare, but it would not be gated on the specific types of things it is comparing. So, it would learn how to compare, not how to compare things. Meta-Learning if you will. Easier said than done though — it turns out that this is really, really hard to do, and very brittle when done at all.
And that gets us finally to Attentive Recurring Comparators also called ARCs(•). These are a class of Deep Learning models that work lazily. Broadly speaking, they figure out their representation space (basically how the thing you’re looking at appears in NeuralNetworkWorld) at inference time (when you’re actually playing “Spot the Difference”).
Or, to put it differently, if you toss in the images of fruit above, the system doesn’t go “🤯🤯WTF THESE AREN’T SIGNATURES 🤯🤯”. Instead, it basically sez. “meh, doesn’t look signatures, but whatever, I’m going to play Spot the Difference anyhow”!
To be very very precise, what they do is lazily develop
a representation space that is conditioned on the test sample only at inference time. Until then, all samples presented to the model are just stored as-is in a repository. When the test sample is given, we compare this sample with every other sample in our repository using ARCs to form a relative representation of each sample (the representation being the final hidden state of the recurrent controller). In this relative representation space, which is relative to a test sample, we use a trained classifier that can identify the most similar sample pair, given the entire context of relative representation space. This relative representation space is dynamic as it changes relative to the test sample.
If you look at the above and say “Hey! This is basically One-Shot Learning!”, you’re absolutely right! ARCs allow you to take a system trained on one set of features (“Are these two cars the same?”) and use them to work with something completely different (“Are these two cats the same?”).
Mind you, as with most such things, There Ain’t No Such Thing As A Free Lunch. Doing things this way requires that when it’s playing Spot the Difference (the inference part of DL), it requires a lot more compute power. Basically, you can either do the calculations when you’re learning to play (by becoming really really good at only Signatures) or when you’re actually playing (when you do Fruit, or SpongeBob, or whatever).
And using the compute power when you’re playing means, well, it’s hard to do on phones with their limited CPU/battery, it takes a lot more time, and so on.
Still, it’s a pretty nifty thing, no?
(•) “Signature Verification using a “Siamese” Time Delay Neural Network” — by Bromley et al.
(••) “Attentive Recurrent Comparators” — by Shyam et al.