Exploration, Exploitation and Imperfect Representation in Deep Learning
The algorithms of learning can be coarsely abstracted as being a balance of exploration and exploitation. A balanced strategy is followed in the pursuit of a fitter representation. This representation can either be one that improves a model that is being learned or can be at the meta-level where it improves the algorithm that learns better models.
In exploitation, an automation greedily pursues a path of learning that provides immediate rewards. In exploration however, an automation must decide to forego an immediate reward and select instead a directionless exploration with the intent of discovering a greater reward elsewhere. The strategy to select one over the other is sometimes referred to as “regret minimization”. As a side, Jeff Bezos has a very human interpretation of this strategy:
“Okay, now I’m looking back on my life. I want to have minimized the number of regrets I have.” I knew that when I was 80 I was not going to regret having tried this. I was not going to regret trying to participate in this thing called the Internet that I thought was going to be a really big deal. I knew that if I failed I wouldn’t regret that, but I knew the one thing I might regret is not ever having tried. I knew that that would haunt me every day, and so, when I thought about it that way it was an incredibly easy decision.”
It is also related to the idea of Counterfactual Regret Minimization (CFR). This method is used by Libratus a poker playing machine that has bested professional players. CFR is applicable in domains with imperfect-information. In short, the strategy of selecting exploration over exploitation is relevant to domains with imperfect information.
Stochastic Gradient Descent (SGD), the workhorse learning algorithm of Deep Learning, are algorithms that employ exploitation as its fundamental motivation. SGD works only for networks that are composed of differentiable layers. Convergence happens because there will be regimes in the parameter space that guarantee convergence of iterative affine transformations. This is well known in other fields such as Control Theory (known as Method of Adjoints) as well as in Chaos theory (Iterated Function Systems).
However, exploration features are shoe-horned into classic gradient descent through different kinds of randomness. Examples of these are, the randomness in how training examples are presented, noise terms in the gradient, dropout and batch normalization. When we examine the two phases of gradient decent, we realize that the first phase is dominated by exploitation behavior. This is where we see a high signal to noise ratio, and the convergence is rapid. In this phase, second order methods that exploit the Natural Gradient (see: Fisher Information Matrix) will converge much faster. A recent method known as the Kronecker Factorization (K-FAC) that approximates the FIM has shown to exhibit 20–30 times less iterations than traditional first order methods.
In the compressive phase, exploration will dominate and randomization methods facilitate these explorations. In this regime, the gradients carry negligible information and thus the convergence is extremely slow. This is where representation compression occurs. The elusive goal of Generalization is achieved through the compression of representation. We can explore many interpretations as to what Generalization actually means, but ultimately, it boils down to the shortest expression that can accurately capture the behavior of an observed environment.
Evolutionary algorithms (aka Genetic algorithms) occupy the space of exploration approaches. In Deep Learning, evolution algorithms are usually been employed in searching for architectures. It is a more sophisticated version of hyper-parameter optimization in that instead of juggling constants like learning rates, the search algorithm juggles the composition of each layer of a network. It is used as the outer loop of the learning algorithm. The thing though about evolutionary algorithms is that serendipitous discovery is fundamental. In short, it works only when you are lucky.
Either method or a combination of both can lead to a fitter Representation. Let’s deconstruct the idea of Representations. In a previous post, I discuss 3 different dimensions of intelligences that are being developed ( computational, adaptive and social). The claim is that these are different kinds of intelligences. What is apparently obvious is that the domains in which they operate are different from each other. So form will have to follow function. The methods and architectures that are developed for each kind of intelligence are going to be different from each other. One dimension of Representation is obviously the domain in which it is applicable.
There is of course a question whether we should explore different kinds of Representations. I mean this at a more general level. In Deep Learning, there are all kinds of different neural embeddings. These embeddings are vector representations of semantics. These vector spaces are learned over the course of training. These vectors are supposed to represent an invariant form of the actual concept. One major difficulty of Deep Learning systems is that these representations are extremely entangled. The fact that they are entangled can explain why Deep Learning systems have zero conceptual understanding of what they predict. Understanding requires the creation of concepts, if concepts cannot be factored out, then what does that imply for understanding? It is important to realize, that there are many cases where understanding is not needed for competence.
I will argue that AlphaGo doesn’t understand Go in the same way as humans. Humans understand Go by creating their own concepts behind the strategies they employ. AlphaGo doesn’t have an understanding of these concepts. Rather, it has memory and the statistics of billions of moves and their consequences.
The concepts that exist as part of a Representation may exists in 3 forms. A generalizations, a prototype or an exemplar. Deep Learning focuses on creating generalizations through the capture of an invariant representation. This is why, data augmentation is a best-practice approach. So when working with images, images are rotated, cropped, de-saturated etc.. This trains the network to ignore these variations. In addition, Convolution Networks are designed to ignore image translations (i.e. difference in locations). The reason DL systems require many training sets is that it needs to “see” enough variations so that it can learn what to ignore and what to continue to keep relevant. Perhaps however that the requirement for invariances is too high and we should seek something less demanding in the form of equivariances.
In the realm of few-shot or zero-shot learning, where an automation must learn something by seeing it only once or a few times, then there is zero opportunity to discover invariances. An automation only has a few examples to create a prototype, that is representative of the entire class. So there needs to be some prior model that is capable of performing the appropriate similarity calculation. The system must know how to determine if an example is similar to a prototype.
Even worse, if its just one example, then there isn’t really a class and the system has to deduce a generalization. The implication of the latter is that, an automation requires that an internal model existing prior to any deduction.So we have here three kinds of models. A model-free representation (learned through induction), a representation for a similarity algorithm and a rich representation that can drive deduction (or abduction).
It is interesting that the human mind is able to recall unusual configurations, yet be unable to recall finer details. This is an example of attempts of people to draw the Apple logo:
Meanwhile DL networks, that have zero understanding of an image, are much better at recreating images than average humans:
Not many people have the ability to visualize in their head an image and recreate it properly via a drawing. I’m unsure if this is a weakness in their articulation skills or that the details of an object aren’t actually captured by the brain. In fact, human’s are typically blind in many contexts:
Another illustration of ‘inattentional blindness”:
The human mind simply does not have the same kind of photographic memory that a machine has. Its capacity is simply limited and requires throwing out a lot of information away so as to cope with information overload.
In a previous post, we explored how model-free and model-based cognition can be interleaved in the process of learning. Exploration and exploitation can also be interleaved in learning. However, both sets are orthogonal. As in SGD, you can have a model-free algorithm that uses both exploration and exploitation. You can also have model-based algorithms that explore or exploit. That is, there are at least three dimensions that are described here. One dimension is on the axis of exploration to exploitation. The second dimension is if the learning process is driven by a explicit model or not. The third dimension is the nature of the learned representation itself.
Bridging the semantic gap is still an extreme challenge.