Semantic correspondence via PowerNet expansion

Wah Loon Keng
4 min readAug 9, 2018

--

  • start with cartpole network, 4 inputs, 2 outputs, train network
  • then, start with 3D cartpole. we can manually correspond the 2D 4 inputs 2 outputs, then append the third dimension inputs and outputs. this extension can be seen as module extension, using PowerNet style to do sideway connections. When training, base module has a lower learning rate, new module higher. These are of course assuming that the goal is still similar (e.g. from “balance in 2D” to “balance in 3D”). The training should take less time than training from scratch in 3D.
  • We can consider the above as sideway extension, defined as no inputs from the new module is connected to the inputs of the old module, and likewise for outputs.
  • If we know some inputs/outputs are at different hierarchy, then we can do the power net extension vertically. I.e. the resultant network becomes taller, although not uniformly.
  • By swapping the two modules and sequence of training, we can perform the inverse of extension, which is interpolation. For example, if we train something harder, but we wish to extract a smaller module, we can do interpolation, and take it out for use elsewhere.
  • The above describe extension and its inverse. Next we can also do combination. This means taking two pretrained modules and use one to extend the other similar to above. However, the difference is when training, the pretrained weights should be subject to lower learning rate, while the newly formed connections in the combination has a higher learning rate. Combination can be done sideway symmetrically (bipartite connections), sideway asymmetrically, and vertically, using extension or its inverse.
  • Now, how does it relate to: PowerNet, multitask, correspondence, grounding, modularization?
  • PowerNet is obvious, as it serves as the enabler for network modularization and hierarchy differentiation (having clear differences between hierarchy of tasks). Besides, it also helps ensure completeness of operations above, i.e. we won’t miss crucial connections for all forms of extensions and combinations.
  • Correspondence and grounding: pretrained input and output units with their corresponding weights already have the first grounding, say left and right. The new module’s new connections to be adjusted with higher learning rate serve to learn to map new inputs and correspond to categories already known. E.g. say original unit 1 means left, unit 2 means right, and we are given new unit 5, 6 from the new module, and 5 corresponds to left and 6 to right, then we should use vertical extension, and adjust the weights so that unit 5 triggers unit 1 (or triggers like unit 1 to the hidden layers), and 6 for 2 similarly. With more learned inputs and outputs, if categories can correspond, it means the network learns to understand a category in more channel, e.g. the word vector for “red” with the vision RGB value for red.
  • If new units do not correspond, then the connection weights become 0, no big deal.
  • The other correspondence could also be from input to output, e.g. inputs from a new module needs to be controlled with outputs from the old module. This is usually the case in games, where controls are fixed and there are always correspondence in the output, e.g. the buttons for left, right, up, down. In fact, if this is the case, we can avoid doing outputs-outputs correspondence between modules at all, unless the task is to map between different controls.
  • The weights are adjusted using the loss determined by the task objective. That is, if the correspondence adjustment helps the agent reuse the old component and provide successful outputs, then reinforce it, and vice versa. This is up to the curriculum design to ensure that correspondence will produce a valid grounding and category expansion. A crucial thing to beware of is the task objective. Lets say, if the old module’s task is to move left, but the new module’s task is to move right, then the correspondence is going to be flipped and become new-right-to-old-left. Therefore, objective is an external factor which controls the grounding into new channels.
  • In contrast, for the same environment with grounded semantics, if the objective suddenly changes, e.g. from “move left” to “move right”, the module should not be relearned, otherwise it would be a huge change. Instead, objective should be fed to part of the module. E.g. there should be a context vector, and in training, if the vector says “move left”, then the reward is given if the proper action is done as per the objective statement. In training, cycle through different objectives. In fact, this is a crucial way to differentiate category for grounding, while also providing a bridge between our semantic and the agent’s, i.e. we know when we input “move left” (which is in English and means what we imply), then it will agree with out semantics to move left, as trained by curriculum we design properly as well. This shall be the main mechanism which we transfer the human’s grounded semantics to machines, so we can understand one another without translating from their arbitrary dictionary.
  • multitask: the design above actually incorporates the correspondence principle already (hence the name correspondence in it). E.g. train to balance a ball in 3D, then it must be able to balance without training in 2D. Moreover, it can handle semantic-multitask, where each subtask should correspond to a category, e.g. “left”, “right”. The vertical possibility of PowerNet means that it can learn meta-control, e.g. controlling the timescale which it acts, or having a hierarchical policy. The entire goal of semantic-multitask is so that agent “learns symbols, not signals”. Training efficiency is bad now because conventional algorithms always learn one-off weights from scratch, when they should be learning reusable components instead. If I already learn how to balance a thing by tilting left or right, then when given a new balancing task, I should only need to figure out what corresponds to left, right (e.g. if the video is now rotated, so the new left, right is up, down), and apply what I have already learned.

credits: ideas are mine and Laura’s

Read part 2 here:

--

--