How Deep Learning Neural Networks Extract Features

Lei Cao
8 min readDec 20, 2018

Table of Content

  1. The History of Feature Extractions
  2. How popular Neural Networks Extract and Understand Features
  3. Conclusion

The History of Feature Extractions

First, we need to understand how the idea of feature space has come into existence.

Researchers knew artificial neural network as a universal function approximators and from the very beginning, it was known that multiple number nonlinear transformations smoothen out non-convexity of any mapping or learning. But the problem was how to train very deep networks, especially when the dimension of the input space is very high. So, the concept of feature space has come, where researchers manually reduced the number of input dimension and smoothen out some non-convexity through feature space.

But since the last decades, due to the availability of enough level data, computational resources, regularization schemes, and the use of special architectures of the network, training of deep network is possible. This possibility is called deep learning and now affecting every aspect of our life.

Finally, it was observed that representation of the data in intermediate layers of a trained network is following some pattern from simple to complex, which is similar to feature space as discussed above. This answer is for a generic feed-forward neural network.

For Convolutional Neural Network (CNN), you can think in a different angle, where the architecture of the network provides a technique for reduction of input space dimension, retaining important features to be learned easily. The concept of filters in the convolutional network gives direct similarity with feature space.

While CNNs assume data points are independent to each other, Recurrent Neural Networks (RNN), on the other hand, address the sequential relationship among the data points. It aggregates the information along the sequence in different ways among technics like Gated Recurrent Unit, LSTM, Attention, etc.

How popular Neural Networks Extract and Understand Features

FNN

The feedforward neural network was the first and simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network. It basically extracts all the given features from the data.

CNN

CNNs use convolutional layers to extract features and use pooling (max or average) layers to generalize features. The set of the various filters they used for Convolutional Layers extract different sets of features. The shallower the layers are, the more detailed the features extracted will be; and the deeper the layers are, the more general the features extracted will be. Let’s try to understand this by looking at the following images:

The first few layers extract detail features like edges, corners, and curves that don’t amount to any meaningful object. The last few layers begin to extract more general features — the dominant objects in an image — like faces, body figures, whole objects. This process mimics how humans detect an object and how humans recognize and differentiate objects from every little detail we can see.

Another good example:

RNN

Why do we need sequence models when we already have feedforward networks and CNN? The problem with these models is that they perform poorly when given a sequence of data. An example of sequence data is an audio clip which contains a sequence of spoken words. Another example would be a sentence in English which contains a sequence of words. Feedforward networks and CNN take a fixed length as input, but, when you look at sentences, not all are of the same length. You could overcome this issue by padding all the inputs to a fixed size. However, they would still perform worse than an RNN because those conventional models do not understand the context of the given input. This is where the major difference between sequence models and feedforward models lies. Given a sentence, when looking at a word, sequence models try to derive relations from the previous words in the same sentence. This is similar to how humans think as well. When we are reading a sentence, we don’t start from scratch every time we encounter a new word. We process each word based on the understanding of the previous words we have read.

Let’s take a look at the structure of a Basic RNN:

Each node at a time step takes an input from the previous node and this can be represented using a feedback loop. We can unfold this feedback loop and represent it as shown on the right. At each time step, we take an input Xtand V(output of the previous node) and perform computation on it and produce an output O_t.

The inputs (features) from previous steps are reserved. After they are combined with the current or future steps’ features and prediction is made for every step. The predictions are based on all the information fed until the current step.

This sequential reservation of features is the way the RNN extracts features. But there’s a disadvantage to basic RNN is that, due to the Vanishing Gradient problem, a current step cannot get reserved feature from a long time ago. Thus the feature reservation is not very effective. Ex: “I grew up in France ……………... I speak fluent ( French )”, to predict that you speak French, the network has to look to a long time ago. Thus, this kind of prediction could be very hard for RNN.

GRU

A Gated Recurrent Unit uses an update gate and a reset gate. The update gate decides on how much of information from the past should be let through and the reset gate decides on how much of information from the past should be discarded.

In the above figure Z_t represents the update gate operation, whereby using a sigmoid function, we decide on what values to let through from the past. h_trepresents the reset gate operation, where we multiply the concatenated values from the previous time step and current time step with r_t. This produces the values that we would like to discard from the previous time steps.

So GRU uses Update Gate and Reset Gate to decide what features to be reserved to overcome the Vanishing Gradients problem since the model is not washing out the new input every single time but keeps the relevant information and passes it down to the next time steps of the network.

This is how a GRU network extract features

LSTM

LSTM

The structure of an LSTM network remains the same as an RNN, whereas the repeating module does more operations. Enhancing the repeating module enables the LSTM network to remember long-term dependencies. Let’s try to break down each operation which helps the network remember better.

1. Forget gate operation

Forget Gate Operation

We take the input from the current time step and the learned representation from previous time step and concatenate them. We pass the concatenated value into a sigmoid function which outputs a value(f_t) between 0 and 1. We do an element-wise multiplication between f_t and c_t-1. If a value is 0, then it is eliminated from c_t-1, if the value is 1, then it is completely let through. Therefore, this operation is also called “Forget gate operation”.

2. Update gate operation

Update operation

The above figure represents the “Update gate operation”. We concatenate values from the current time step and the learned representation from previous time step. By passing the concatenated values through a tanh function we generate candidate values and by passing it through a sigmoid function we choose which values to be selected from the candidates. The chosen candidate values are updated to c_t-1.

3. Output gate operation

Updating the values
Output operation

We concatenate values from current time step and the learned representation from previous time step and pass it through a sigmoid function to choose which values we are going to use as the output. We take the cell state and apply a tanh function and do an element-wise operation which lets through only the selected outputs.

Now, this is a lot of operations to be done in a single cell. When using a bigger network, the training time would significantly increase compared to an RNN. GRU is an alternative to LSTM if want to reduce your training time but also use a network that remembers long-term dependencies.

So, LSTM extracts features by maintaining a pair of long-term and short-term memories. In each LSTM cell, the model chooses what to be passed from Long-term memory and what should be kept from short-memory by computation.

Conclusion

The mainstream Deep Learning Neural Networks have their own advantages as well as disadvantages. When we have our dataset, we need to first determine what kind of problem we need to solve and what kind of data the dataset has. If the data points are independent of each other, chances are that a CNN model is more suitable, otherwise, we need to choose an RNN to address the relationship between the data points.

But the active deep learning research community is not satisfied, people are making lots of experiments to mix CNN and RNN, or any other combination of their subcategories, trying to leverage the advantage of each model and achieve a new generation of Deep Learning Neural Networks.

Reference

--

--