An IISc Lecture: Deep Learning Research- Representation Learning
In his hour long lecture pertaining to cutting-edge research in deep learning(DL), Professor Sargur Narasimhamurthy Srihari touched upon various topics including the significance of representation, representation learning methods, transfer learning, disentangling variables and explainable artificial intelligence (AI). Prof. Srihari is a SUNY Distinguished Professor in the Department of Computer Science and Engineering at the University at Buffalo. A pioneering researcher, his research efforts have resulted in the first large-scale handwritten address interpretation systems in the world (deployed by the IRS and USPS), post-Daubert court acceptance of handwriting testimony based on handwriting individuality assessment, a software system in use by forensic document examiners worldwide, statistical characterization of uncertainty in impression evidence and first characterization of document image analysis as a sub-field of pattern recognition. I strive to highlight important points of the session.
Note: Most of the images have been taken from sir’s lecture slides and are available on his website.
The Importance of Representations
Synonymous with the concept of having different metrics to measure distances (like miles and km), having good representations for our problem statements are vital for enhanced and effective learning. You must be wondering how representations matter to a computer science student. Via a simple example of inserting a number in a sorted list versus inserting a number into a list which is represented as a red-black tree, it can be shown that the complexity varies significantly. The former has a complexity of O(n), whereas the complexity of the latter is O(log n).
But what has this got to do with machine learning (ML)? Is there an ML-centric example which depicts the importance of representations? Fig.2 depicts how changing the coordinate system ((x,y)->(r,θ)), can influence the ease with which classification is performed. In the second case, a linear classifier can be used effectively. Representations should be disentangled, noise invariant, informative and uncomplicated (when it comes to working with them). Representations do matter- we aim at learning representations after all!
Role Played by Deep Learning When it Comes to Representations
To cut things short, deep learning finds representations. The dark regions imply that the corresponding components can learn from data. Extending the concept of representation learning to multiple layers gives rise to deep learning frameworks, as shown below (Fig.3).
If you’re a deep learning enthusiast, it is plausible that you’re au fait with the next few topics. But that’s not why I’ve penned down my learnings; this article doesn’t focus on brushing up your basics or going into the nitty gritties. It’s more about facts, clearing ambiguities and discussing current research work.
Representation learning methods comprise the following:
- Convolutional Neural Networks (CNN)
- Recurrent Neural Networks (RNN)
Autoencoders are unsupervised learning algorithms. The following facts hold true for deterministic autoencoders. Encoders are designed to obtain compressed and highly informative representations of the input. Compressed forms hold relevant information and data properties pertaining to the task . Decoders are expanded/converted to reconstruct the input. Sparse autoencoders penalize the compressed representations (in the hidden layer), via a sparsity penalty.
“The model is forced to prioritize which aspects of input are copied”.
Another notable fact would be that autoencoders learn manifolds. In layman’s terms, imagine a manifold to be a subset of the vector space (ambient space) of the original data i.e. a low dimensional space associated with the compressed representation. A manifold can have several dimensions. An interesting observation would be that any atypical behaviour can be observed if the input is not on the manifold. Essentially, a manifold can be perceived as a projection onto a lower dimensional space (which is why sir has quoted an example of PCA).
“Autoencoders learn a representation function that maps any point in ambient space to its embedding”.
An embedding is a low dimensional vector or a point on the manifold. Fig.6 depicts embeddings on a manifold of familiar classes.
How is the compression different from JPEG or Mp3 compression and how do we harness the power of these representations?
Autoencoders are lossy and learnt. Furthermore, the intermediate compressed representations can be modified in the encoded space, to obtain outputs that have never encountered. DeepStyle is an example of the same.
“Using one input image and changing values along different dimensions of feature space you can see how the generated image changes (patterning, color texture) in style space.”
Another instance where compression can play an important role is in the case of analyzing tweets; a researcher is known to have analyzed 198 billion tweets with 2.8 billion words in the training set where a 30,000 dimensional One hot vector can be represented by a compressed vector of 300 dimensions (rich with information). A recent paper termed, ‘Embedding Historical Linguistics Data’ makes use of words shared across various languages to obtain a hyperbolic geometry (in the compressed form) as shown in Fig.9. This is an ICML’18 proceeding.
The Underlying Concept Behind CNNs
Understanding Sparse Connectivity
Sparse interactions are a consequence of the kernel size being smaller than the input image size. For simplicity sake, imagine a hidden unit in layer L , as the input to the hidden units in layer L+1. In a traditional feed forward neural network, the current hidden unit will serve as an input to ALL the hidden units of layer L+1 (unless you apply dropout regularization). The crux in designing CNNs, lies in the fact that the current hidden unit will impact or effect only k hidden units of layer L+1, where k is the kernel size. Thus, this is an implication of sparse connectedness.
CNNs and Matrix Multiplication
Operations associated with CNNs are not 1–1 with mathematical convolution, owing to sparsity. CNNs ignore the effects of one layer on another. Fig.10 represents CNNs versus traditional matrix multiplication. The shaded regions s_x correspond to the units that are impacted by x_3.
Computational Graphs for RNNs
An RNN can be perceived as an unfolded form of a graph of RNN, over various time instances. Fig.11 represents how a single computational graph can be replicated over several time instances. Cycles signify how the current value of a variable impacts its future value, at a particular time step. Let us assume that a person discusses about Kalman filters, what she ate for breakfast and returns to the topic of Kalman filters. Although she has made a connection of her references to something that she quoted earlier, how does a system correlate current context with prior information ? Contextual information is pivotal in such a scenario and the concept of ‘larger memory over a longer time’, is sacrosanct in such NLP related tasks. This gave rise to LSTMs where the cell state or conveyor belt is crucial in transferring previous information.
If you’re a newbie to DL or if you don’t have the required hardware to train deep networks, this is an ideal first assignment! We use transfer learning when:
- Two or more tasks have to be performed.
- We can safely assume that the factors describing the variations in task X , are relevant to the variations that are to be considered for learning task Y.
For example, a DL framework designed for the recognition of handwritten English numerals, can be used for Indic script numeral recognition. Transfer learning is characterised by shared representations and is known to improve the performance of DNNs.
Understanding an Optimal Representation for ML
A good representation will make the classification task easier as we progress through successive layers. It is plausible that classes which are not linearly separable in input features, become linearly separable in the final layer. The entire task can be viewed as learning a single function f(x), which comprises functions like f(1), f(2) and f(3). At each stage of solving for f(x), the classification becomes easier, ideally.
f (x)= f (3) [ f (2) [ f (1)(x)]]
“An ideal representation is one that disentangles the underlying causal factors of variation that generated the data, especially those that are relevant to our application. Most strategies for representation learning are based on introducing clues that help the learning to find these underlying factors of variation.”
There are a few more concepts that I will be covering in subsequent blog articles. Here are a few ongoing areas of research in DL, that you could explore.
- Disentangling or Untangling Variables.
- Semantic Manifold and Manifold Learning : Prof. Sargur used an example of handwritten text to obtain images which were generated from a manifold. The core idea behind this was to travel along a dimension in latent(hidden) space, ensuring that the other dimensions wereunaltered. His PhD students are working on deep learning for handwritten comparison; unfortunately, its details are unavailable as it’s an ongoing project using variational autoencoders for forensic comparison. Incase you’re interested in handwritten recognition, give this a read .
- Capsule Networks : It’s rather ironical that Geoffrey Hinton expresses his dismay in using CNNs, stating that the pooling operation is a blunder and its apt working could be cataclysmic for the DL community. Recently, he proposed CapsNet to overcome the drawbacks held by CNNs. Here’s a concise blog article regarding the same. (I was introduced to this, by a friend).
- Explainable Artificial Intelligence: Ever wondered why the answer given by a DNN, is the answer? The algorithm certainly isn’t the answer and it’s high time we contemplate something that is in-between the input and the output. Researchers are working at incorporating an explanation interface (as shown in Fig.1) to the learning system. This will help in answering some of the following questions:
- Why was X performed?
- Why wasn’t Y or any other method performed?
- What are the scenarios where this will be correct (or incorrect)?
- How can mistakes pertaining to X be corrected?
- When can we trust this? (If you’re working on detecting malignancy in images or deciding the execution of a criminal, errors are unacceptable).
- Will this cheat, in any scenario? (For example, a recent experiment associated with recognizing horses, worked incorrectly despite optimal results, as each of the input images pertaining to horses had a copyright symbol. The classifier used the copyright symbol as an implication of a horse :P).
“Most of AI is a black-box when it comes to learning”.
5. Extending concepts of Variational Autoencoders (VAE) and Generative adversarial networks (GANs) to explainable AI.
Open-ended Questions by the Audience
- Does explainable AI aim at improving existing frameworks or is it a new and unconventional approach altogether?
- As representation learning advances, feature representations will change. Can the current DL models handle this? (Apparently Kalman filters and LSTMs can handle this).
- Will there be a point where DL beats classical ML in all scenarios?
- Can transfer learning be conceived as a regularizer?
Note: This is an overview of the seminar. I shall post a few more articles in detail. None of the above-mentioned facts are my findings; I have merely jotted down points that were discussed in the lecture. Feel free to correct me or add a few more ongoing research areas in DL (there are a plethora of them, by now) :)