Representation learning — The core of machine learning
Representation learning is a key concept in machine learning, which has become tightly connected to deep neural networks in recent years. In machine learning representations are used to transform the input data to a mathematical form, which can be then effectively used by a learning algorithm (e.g classifier). The performance of the machine learning algorithm relies heavily on the representation and finding a good representation can facilitate discovering structure in the input data by the learning algorithm. In this blog post we look at the basis of representation learning and try to get a grasp of recent trends within the field.
Why should we learn features?
Representation learning aims to learn features from raw data that can be used in further processing steps in the machine learning task. In contrast to the more traditional feature engineering, which uses domain knowledge to extract features from the raw, often high-dimensional input data, in representation learning, these features are automatically discovered by the system. To understand why it is necessary to learn these features instead of relying on hand-engineering to complete the given task, we have to consider the transferability of the representations across different tasks. Hand-engineered features can be applicable for a small subset of problems and these solutions do not generalize well to small changes in the task conditions or in the input data. If we want features that are more generally applicable and less sensitive to task conditions, a better approach is to let the algorithm itself discover the task relevant features from the input data. Deep learning methods are based on multi-layer neural network based faature extraction. Deep neural networks can learn highly abstracted representations from input data, which has proven to facilitate the machine learning task and have been proven to achieve unforeseen performance in many fields (computer vision, natural language processing, robotics).
The HOG (histogram of oriented gradients) feature descriptor is an example of traditional feature engineering in computer vision used for object detection. The domain knowledge used here is that high intensity gradients encode edges and images can be described by the distribution of edge directions (intensity gradients). The input image can therefore be represented by counting the occurrences of edge directions in localized portions of the input image and objects can be detected based on these edge distributions. In contrast, these edges are automatically extracted through the hierarchical representation learning in deep learning algorithms.
What is a good representation?
In simple terms, we could define a good representation as the one that makes the subsequent learning task easier. However, this brings up questions about how task-specific the learned representation should be. For example, in case of visual representations, if we want the representation to be more generic, the representation should be insensitive (invariant) to translations, rotations and occlusions.Invariance allows building features and representations that are sensitive to the aspects of the data we care about and insensitive to the other factors. Furthermore, the learned representation should also separate factors of variations in high-dimensional space of the raw input data (disentanglement). Following Bengio (2013) a disentangled representation is a representation where a change along one dimension corresponds to a change in one factor of variation. This means that instead of learning a simple compression of the data, the learned representations axis aligns with the generative factors of the data, which is based on physical laws, so these factors of variations will be shared in different datasets, and uncovering these helps quicker generalization to new tasks. In contrast to feature engineering, the algorithm discovers the underlying factors behind the data and therefore it is not not required that humans specify the meaning of those factors. As a result of disentangling the factors of variations, the subsequent allows the machine learning tasks to be easier. Since we are not sure of the downstream tasks beforehand, it’s important to not throw away important factors; however, disentangle them and make it easier for the classifier to distinguish between the factors of interest.
These requirements of invariance and disentanglement for useful representations are not new, eg Gibson (1979) on visual perception wrote:
Four kinds of invariants have been postulated: those that underlie change of illumination, those that underlie change of the point of observation, those that underlie overlapping samples, and those that underlie a local disturbance of structure.
However, these ideas have been only recently successfully implemented in algorithms thanks to deep neural networks. Deep neural networks can learn highly abstracted representations from input data while also fulfilling the previously mentioned requirements for a good representation. In a simple example of disentangled representation learning we can think of a deep neural network generating faces, by learning factors of variations from input data (shape of face, hair color), we could have a controllable generation of new data. In recent years many subfields have emerged in deep neural network based representation learning. Many of these approaches uses some “clever tricks” to achieve desirable properties in the learned representation. This can be thought of as high-level feature engineering, where the domain knowledge is often inserted through the architecture of the neural network. We briefly explore some of these ideas in the next section.
From Input to Representations using State of the Art Techniques
With this prelude, we will now dive into different state of the art techniques and concepts to capture important features from the input data such as manifold learning, inductive biases, generative adversarial networks and semi-supervised learning.
Manifold learning takes a geometric point of view on representation learning. The manifold hypothesis states that the real world high-dimensional input data lies on a low-dimensional manifold. This means that most of the highly important probability mass is concentrated in a small region in a lower dimensional space, called manifold. This could be imagined that for any signal including speech or images, the set of data sequences that correspond to particular speech or image occupies very small volume (the latent space) in the space of all possible sequences. Image vectors representing 3D objects under varying conditions and phonemes in speech signals are examples of lower dimensional manifolds embedded in higher dimensional spaces. The latent variable models are typical examples of manifold learning that can encode the complex function to a flattened function in a lower dimension which we refer to as the “representation”; on the other hand the decoder can do the opposite of that. In the high level representation of the latent space, some directions correspond to meaning attributes. In general, this lower dimensional structure typically arises due to constraints from physical laws of the world. Therefore we can also relate manifold learning to ideas of disentangled representation learning, mentioned in the previous section of learning good representations.
Inductive biases in machine learning are assumptions that give the machine learning model a priori preferences for certain generalizations over others for the given machine learning task. They encode our assumptions of the learning problem in the model. Inductive biases in deep learning are usually implemented via the network architecture design. In this way, they relate back to traditional feature engineering methods, but in this case the domain knowledge about the problem at hand is inserted to the learning algorithm through the neural network atchitecture.Choosing the right inductive bias can facilitate the given machine learning task, therefore it is an important design choice. The most well-known example of inductive bias are convolutional neural networks, which encode a relational inductive bias for locality, which gives spatial translational invariance. Another example is recurrent neural networks, which encode inductive bias for sequentiality through the architecture and have temporal invariance.
On the other hand, a game theoretical scenario developed under the name Generative Adversarial Networks (GANs) allows us to train a probabilistic classifier for all the data points closer to the manifold as seen above, by using two neural networks working against each other. In this game, the representation of the data distribution is expressed through a generative process where typically a gaussian distribution is transformed into a nonlinear space. The objective function for this generator is paired through a discriminator which recognizes if the generated data point is closer to the available data or is easily distinguishable to be an outlier. This idea represents the min-max game theory.
Similarly, in semi-supervised learning, for every signal we can perform unsupervised learning by predicting a part of the data from another unseen part of the data to capture underlying semantics. The questions can be which is the original order for a given video, or finding patches of images that belong together (context pred ,iction), or patches that belong in a jigsaw puzzle and similar others. These types of models allow us to predict high level representations from other sets of high level representations which possess mutual similarities. Contrastive Learning is an interesting subset of semi-supervised learning that has recently significantly improved the baseline in learning representations from unstructured and unlabelled datasets.
In future blog posts we will explore some of these concepts more in depth.
Authors :
Anna Deichler, PhD candidate in machine learning, KTH Sweden
Kiran Chhatre, PhD candidate in machine learning, KTH Sweden