Computer Vision: Learning a common language to describe how visual concepts can change

Angus Russell
Syntropy
Published in
6 min readJun 5, 2018

This is the third of a series of articles explaining the work we’re doing at Syntropy, and tracking our progress as we make ground through some of the unsolved (or unsatisfactorily solved) problems in machine learning. These articles are split into technical (for Machine Learning professionals) and non-technical (for a more general audience). This article is non-technical.

To briefly recap the ground we’ve covered so far, in part one we explained that the visual world is made up of a hierarchy of parts. Bikes are made up of handlebars, wheels, pedals, etc; wheels are made up of tyres, spokes, hub, etc; and at the lowest levels, everything is colours, edges, shapes and textures. At each layer of this hierarchy, our brains are invariant to some degree of change. At the lower levels of the visual hierarchy, invariance allows you to recognise a rectangle or line even when it is skewed, rotated or scaled; and at the higher levels it allows you to recognise people and objects regardless of viewing angle, lighting conditions, or context.

We can see that these two pedals are the same, even though the shapes and lines that compose them are completely different.

But that’s not enough. In part two we showed that not only are our brains invariant to changes in visual concepts, they can also describe those changes. Each visual concept can change in a variety of different ways (lighting, rotation, skew, position, etc), and our brains can distinguish between them.

At the end of part two, we showed how a single concept can be ‘disentantled’ into a set of equivariant dimensions of change. The demo (gif version below) shows a simplified visual concept — a heart that changes along x, y and rotation dimensions. Each possible version of the heart is shown as a point in the 3d graph. Initially the points are distributed randomly, but with training they find their true order, and once training is completed, the sliders each represent a single dimension of change.

Live CodePen — https://codepen.io/mattway/full/ayaXpx/

You’ll find though, that if you move the sliders after it’s trained, and take note of which slider represents which transformation (horizontal movement, vertical movement and rotation), then hit ‘Reset’ and train again from a random position, the sliders now probably represent different axes from the first time. That’s because the data doesn’t contain any information about what axis is what — in fact it doesn’t contain any information about the axes at all. It’s simply governed by the rule “an input contains two versions of the heart that belong next to each-other in the manifold”. From this we can disentangle the dimensions of change, but we can’t know which dimension is which unless we manually label it (E.g. “horizontal movement”) after it has been disentangled.

So now let’s imagine scaling up our demo to include multiple visual concepts. Each concept can disentangle itself into a set of dimensions of change, but there’s nothing linking those dimensions together across concepts, so they’re really not very useful.

But if we had every dimension of change labelled (E.g. horizontal movement, vertical movement, rotation, lighting, etc), we’d be able to use that information to do some really cool things. For example, we could take a single example of an item, made up of known visual concepts, but in an arrangement we’ve never seen before, and imagine how it looks from other angles, in different lighting, upside down, etc. This would simply involve taking all the concepts that make up the item and moving them equally along the same transformational axes.

We could also use the transformational relationships between concepts to accurately classify new inputs using only one (or very few) labelled exemplars. This is called one-shot learning, and is considered an open problem in computer vision. In this case the process would be:

  1. See a new input.
  2. Determine which visual concepts are present (see part 1).
  3. Find any labelled examples that have significant concept overlap.
  4. Choose the example that most closely matches the transformational relationships between concepts.

To be able to do these things, we need to be able to identify the relationships between the transformational dimensions across all the concepts in our system. In other words, we need to be able to say “the 2nd dimension in this concept represents horizontal movement, while the 3rd dimension in this other concept represents horizontal movement”. So how can we achieve this?

We could just move a slider for each dimension, see visually what it represents, and label it manually. Technically, this would give us a global language for describing change across concepts. While that might be feasible, even if we had to do it for tens of dimensions multiplied by hundreds or thousands of concepts (lets say 20 dimensions * 1000 concepts = 20,000 labels), it just seems like there must be a better way. Well it turns out there’s a pretty simple but effective method we can use.

If you think about how the world changes as you watch it (E.g. someone walking towards you, a leaf moving in the wind, etc), the changes that are occurring all the way up the hierarchy of visual concepts are the same set of changes. While that leaf moves to the left, all of the patterns and textures and shapes that make up that leaf are moving to the left. While this data doesn’t include the label “left”, it does tell us that the dimensions that each concept is moving along probably represent the same thing. This won’t always be the case, but it will be accurate more often than not, so each time we see two concepts changing at the same time, we can add a ‘vote’ to associate whichever unlabelled dimensions within them are changing at a similar rate. Given enough time and examples, we can accurately associate every dimension within a concept to a particular dimension within every other concept. At that point we have an unlabelled but global language for how concepts can change, and if we want to label the dimensions, we now just need to add a single label per dimension (20 labels rather than 20,000).

In the demo below, we show two concepts, a vertical and horizontal line that can change along the x and y axis, and have some degree of rotation. These dimensions of change are initially entangled, but disentangle themselves during training. For training data, each concept is given two input examples at a time. We refer to the dual inputs as an episode with two frames, and stipulate that the two frames of an episode contain versions of the concept that are similar (I.e. only change a maximum of one step along each axis of change). We also ensure that the two concepts see an episode that is changing in the same ways (E.g. both move left, or both rotate clockwise). This setup mimics how we see things moving in the real world.

Mobile readers are better off viewing directly on CodePen

As the concepts are training, we’re counting the number of times that each dimension in the first concept is seen changing across an episode at the same time as a particular dimension from the second concept. By the time the concepts have fully disentangled themselves, the system has discovered the correct relationships between dimensions. Drag the first set of sliders to move both concepts through the dimensional space using the discovered relationships. Drag the second set of sliders to move the vertical line independently and create different arrangements.

--

--