Deep Learning Notes Part II
Feb 25, 2017 · 14 min read
Restricted Boltzmann Machine (RBM)
- Part of what allowed researchers to overcome the vanishing gradient problem.
- This is a method that can automatically find patterns in our data by reconstructing the input.
- It was the brainchild of Geoff Hinton at the University of Toronto, referred to as one of the fathers of deep learning.
- An RBM is a shallow two layer net, the first layer is known as the visible layer and the second layer is known as the hidden layer.
- Each node in the visible layer is connected every layer in the hidden layer.
- An RBM is considered restricted because no two nodes in the same layer share a connection.
- An RBM is the mathematical equivalent of a two way translator. In the forward pass an RBM takes the inputs and translates them into a set of numbers that encode the inputs. In the backwards pass, it takes this set of numbers and translates them back to form the reconstructed inputs.
- A well trained net will be able to perform the backwards translation with a high degree of accuracy.
- In both steps, the weights and biases have a very important role. They allow the RBM to decipher the inter-relationships among the input features and they also allow the RBM to determine which input features are the most important when detecting patterns.
- Through several forwards and backwards passes, an RBM is trained to reconstruct the input data.
- Three steps are repeated over and over through the training process.
- Step 1: With a forward pass, every input is combined with an individual weight and one overall bias and result is passed to the hidden layer which may or may not activate.
- Step 2: Next in a backwards pass, each activation is combined with an individual weight and an overall bias and the result is passed to the visible layer for reconstruction.
- Step 3: At the visible layer the reconstruction is compared against the original input to compare the quality of the result.
- RBMs use a measure called KL Divergence to compare actual result to reconstructed result.
- Steps 1–3 are repeated with varying weights and biases until the input an the reconstruction are as close as possible.
- An interesting aspect of an RBM is that the data does not need to be labeled.
- This turns out to be very important for real world data sets such as photos, videos, voices and census data. All of which tend to be unlabeled. Rather then have people manually label the data and introduce errors, an RBM automatically sorts through the data. And by properly adjusting the weights and biases an RBM is able to extract the important features and reconstruct the input.
- An important note is an RBM is actually making decisions about which input features are important and how they should be combined to form patterns.
- In other words, an RBM is part of a family of feature extractor neural nets which are all designed to recognize inherent patterns in data. These nets are also called auto encoders because they have to encode their own structure.
- So how does an RBM being able to extract features help with the vanishing gradient? This pertains to the deep belief net.
Deep Belief Nets
- By combining RBMs together and introducing a clever training method, we obtain a powerful new model that finally solves our problem of the vanishing gradient. Which is a deep belief network (DBN)
- Also brainchild of Geoff Hinton at the University of Toronto. Conceived as an alternative to back prop.
- In terms of network structure a DBN is identical to an MLP (multi-layered perceptron) but when it comes to training they are entirely different.
- The difference in training methods is key factor that enables DBNs to outperform their shallow counterparts.
- A DBN can be viewed as a stack of RBMs (restricted boltzmann machines) where the hidden layer of one RBM is the visible layer of the one above it.
- A DBN is trained as follows: The first RBM is trained to reconstruct it’s input as accurately as possible. The hidden layer of the first RBM is treated as the visible layer for the second and the second RBM is trained using the outputs from the first RBM. This process is repeated until every layer in the network is trained.
- An important note about a DBN is that each RBM layer learns the entire input. In other kinds of models, like convolutional nets early layers detect simple patterns and later layers recombine them. (Example: facial recognition, early layers detect edges and later layers would use those results to form features).
- A DBN on the other hand works globally by fine tuning the entire input in succession as the model slowly improves. (Example: a camera lens slowly focusing on a picture).
- The reason a DBN works so well is because a stack of RBMs will outperform a single unit. Just like a multi-layered perceptron was able to outperform a single perceptron working alone.
- After this initial training the RBMs have created a model that can detect inherent patterns in the data. But we don’t know exactly what the patterns are called.
- To finish training we need to introduce labels to the patterns and fine tune the net with supervised learning. To do this you need a very small set of labeled samples so that the features and patterns can be associated with a name. The weights and biases are altered slightly resulting in a small change in a net’s perception of the patterns and often a small increase in the total accuracy. Fortunately the set of labeled data can be small relative to the original data set which is extremely helpful in real world applications.
- So to recap the benefits to a deep belief network: A DBN only needs a small labeled data set, which is important for real life applications. The training process can also be completed in a reasonable amount of time through the use of GPUs. And best of all the resulting net will be very accurate compared to a shallow net.
Convolutional Neural Net (CNN)
- Dominated the machine vision space in recent years.
- They’re so influential it’s made deep learning one of the hottest topics in AI.
- But they can be tricky to understand.
- Pioneered by Yann Lacun at NYU.
- A convolutional net has been the go to solution of machine vision projects in the last few years.
- There are many component layers to CNNs.
- The first component is the convolutional layer.
- Example: Imagine you have a wall that represents a digital image, also imagine that you have a series of flashlights shining at the wall, creating a group of overlapping circles. There are 8 flashlights in each row and 6 rows in total. The purpose of these flashlights is to seek out certain patterns in the image, like an edge or color contrast. Each flashlight looks for the exact same pattern as all the others but they all search in a different section of the image defined by the fixed region created by the circle of light. When combined together, the flashlights form what is called a filter, which is able to determine if the given pattern occurs in this image and in what regions. Let’s start at the top, in practice flashlights from multiple different filters will all be shining at the same spots in parallel simultaneously detecting a wide array of patterns. In this example we have four filters shining at a wall all looking for a different pattern, so this particular convolutional layer is an (8x6x4) 3d grid of these flashlights.
- Now let’s connect the dots of our explanation, why is it called a convolutional net? The net uses the technical operation of convolution to search for a particular pattern. Think of it as the process of filtering through the image for a specific pattern. One important note is that the weights and biases of this layer effect how this operation is performed. Tweaking these numbers impacts the effectiveness of the filtering process.
- Each flashlight represents a neuron in the CNN, typically neurons in a layer activate or fire. But in a convolutional layer, neurons perform this convolution operation. Unlike the nets we’ve seen thus far where every neuron in a layer is connected to every neuron in an adjacent layer. A CNN has the flashlight structure. Each neuron is only connected to the input neuron it shines upon.
- The neurons in a given filter share the same weight and bias parameters. This means that anywhere on the filter a given neuron is connected to the same number of input neurons and has the same weights and biases. This is what allows the filter to look for the same pattern in different sections of the image.
- By arranging these neurons in the same structure as the flashlight grid, we ensure that the entire image is scanned.
- The next two layers that follow are rectified linear units (RELU) and Pooling, both of which help to build up the simple patterns discovered by the convolutional layer.
- Each node in the convolutional layer is connected to a node which fires like in other nets. The activation used is called rectified linear units (RELU).
- CNNs are trained using back prop so once again the vanishing gradient can potentially still be an issue.
- For reasons that depend on the mathematical definition of RELU, the gradient is held more or less constant at every layer of the net. So the RELU activation allows the net to be properly trained without harmful slowdowns in the crucial early layers.
- The Pooling layer is used for dimensionality reduction. CNNs tile multiple instances of convolutional layers and RELU layers together in a sequence in order to build more and more complex patterns. The problem with this is that the number of possible patterns becomes exceedingly large. By introducing pooling layers, we ensure that the net focuses on only the most relevant patterns discovered by convolution and RELU. This helps limit both the memory and processing requirements for running a CNN.
- Together these three layers can discover a host of complex patterns but the net will have no understanding what these patterns mean. So a fully connected layer is attached to the end of the net in order to equip the net with the ability to classify data samples.
- So to recap, a typical deep CNN has three layers, a convolutional layer, a RELU layer and a pooling layer. All of which are repeated several times. These layers are followed by a few fully connected layers in order to support classification. Since CNNs are such deep nets, they most likely need to be trained using server resources with GPUs.
- Despite the power of CNNs these nets have one drawback. Since they are a supervised learning method they require a large set of labeled data for training which can be challenging to obtain in a real world application.
Recurrent Neural Networks (RNN)
- If the patterns in your data change over time, your best network to use a recurrent neural network.
- This model has a simple structure with a built in feedback loop allowing it to act as a forecasting engine.
- RNNs are the brainchild of Jurgen Schmidhuber and Sepp Hochreiter. Their applications are extremely versatile ranging from speech recognition to driverless cars.
- All the nets covered so far have been feed forward neural networks. In a feed forward neural network signals only flow in one direction from input to output, one layer at a time.
- In a recurrent neural net, the output of a layer is added to the next input and fed back into the same layer which is typically the only layer in the entire network.
- Think of this process as a passage through time. Imagine four time steps. Time 0,1, 2, 3. Start at time = 0, at time = 1 the net takes the output of time from time = 0 and sends it back into the net along with the next input. The net repeats this for time = 2 and time = 3 and so on.
- Unlike feed forward nets, a recurrent net can receive a sequence of values as inputs and it can also produce a sequence of values as outputs.
- The ability to operate with sequences opens up these nets with a wide variety of applications.
- When the input is singular and the output is a sequence a potential application is image captioning.
- A sequence of inputs with a single output can be used for document classification.
- When both the input and output are sequences these nets can classify videos frame by frame.
- If a time delay is introduced then that can statistically forecast the demand and supply chain planning.
- Like we’ve seen with our previous deep learning models, by stacking RNNs on top of each other, you can form a net more capable of complex output then a single RNN working alone.
- Typically an RNN is an extremely difficult net to train. Since these nets use back propagation, we once again run into the problem of the vanishing gradient. Unfortunately the vanishing gradient is exponentially worse for an RNN. The reason for this is that each time step is the equivalent of an entire layer in a feed forward network.
- So training an RNN for a hundred time steps is like a hundred layer feed forward net.
- This leads to exponentially small gradients and a decay of information through time.
- There are several ways to address this problem. The most popular of which is Gating. Gating is a technique which helps the net decide when to forget the current input and when to remember it for future time steps. The most popular Gating types today are gated recurrent unit (GRU) and long short-term memory (LSTM).
- Besides Gating there are a few other techniques such as Gradient clipping and Steeper gates and Better Optimizers.
- When it comes to training a recurrent net, GPUs are an obvious choice over ordinary CPUs. GPUs trained RNNs up to 250 times faster in tests.
- So under what circumstances would you use a RNN over a feed forward net? Feed forward net outputs one values which in many ways is a class or a prediction. Recurrent net is suited for time series data where an output can be the next value or the next several values in a sequence. So the answer depends on whether the application calls for classification/regression (Feed forward) or forecasting (RNN).
Auto Encoders
- Auto encoders are incredibly useful when trying to figure out the underlying structure of a data set such as having access to the most important data features gives you a lot of flexibility when you start applying labels.
- A Restricted boltzmann machine (RBM) is a very popular example of an auto encoder. There are other types of auto encoders such as de-noising and contractive.
- An auto encoder is a neural net that takes a set of typically unlabeled inputs and after encoding them, tries to reconstruct them as accurately as possible.
- As a result of this, the net must decide which of the data features are the most important, essentially acting as a feature extraction engine.
- Auto encoders are typically very shallow and are usually comprised of an input layer, output layer and hidden layer. An RBM is an example of an auto encoder with only two layers.
- In terms of the forward pass, there are two steps: The encoding and decoding. Typically the same weights that are used to encode a feature in the hidden layer are used to reconstruct an image in the output layer.
- Auto encoders are trained with Back propagation using a metric called Loss. As opposed to cost, lost measures the amount of information that was lost when the net tried to reconstruct the input. A net with a small loss value would produce reconstructions that would look very similar to the originals.
- Not all of these nets are shallow, in fact deep auto encoders are extremely useful tools for dimensionality reduction.
- Example: Consider an image containing a 28 X 28 grid of pixels. A neural net would need to process over 750 input values for just one image. Doing this across millions of images would waste significant amounts of memory and processing time. A deep auto encoder would encode this image into an impressive 30 numbers and still main information about the key image features. When decoding the output the net acts like a two way translator. In this example a well trained net could translate these 30 encoded numbers into a reconstruction that looks similar to the original image. Certain types of nets also introduce random noise to the encoding/decoding process, which has been shown to improve the robustness of the resulting pattern.
- Deep auto encoders perform better at dimensionality reduction then their predecessor principle component analysis (PCA).
Recursive Neural Tensor Nets (RNTN)
- Useful when trying to discover the hierarchical structure of a set of data such as the parsed trees of a group of sentences. RNTN perform better then feed forward and recurrent nets.
- They are the brain child of Richard Socher @ Metamind
- The purpose of these nets was to analyze data that had a hierarchical structure. Originally designed for sentiment analysis where sentiment of a sentence depends not just on its component words but on the order on which they’re syntactically grouped.
- The structure is as follows: An RNTN has three basic components. A parent group, called the root. A child group which we’ll call the leaves (leaf). Each group is simply a collection of neurons where the number of neurons depends on the complexity of the input data. The root is connected to all sets of leaves but each leaf is not connected to each other. Technically speaking the three components form a binary tree. In general the leaf groups receive input and the root group uses a classifier to fire out a class and a score.
- The RNTN structure may seem simple but just like the recurrent net, the complexity comes from the way in which data moves throughout the network. In the case of a RNTN, this process is recursive.
- Example of recursion: Take a sentence, “The car is fast” and at step one you feed the first two words into leaf groups one and two respectively. The leaf groups don’t actually receive the words but a vector (ordered set of numbers) representation of the words. These nets work best with very specific vector representations. Particularly good results are achieved when the numbers in the two vectors encode the similarities between the two words when compared to other words in the vocabulary. The two vectors (represented by The and Car) move across the net to the root which fires out two values the class and score. The score represents the quality of the current parse and the class represents an encoding of a structure in the current parse. This is the point where the net starts the recursion. At the next step, the first leaf group now receives the current parse rather then a single word and the second leaf receives the next word in the sentence. At this point the root group would output the score of a parse that is three words long (“The, Car, Is). This process continues until all the inputs are used up and the net has a parsed tree with every single word included. This is a simplified example of a RNTN and illustrates the main idea but in a practical application we typically counter more complex recursive processes. Rather then use the next word in the sentence for the second leaf group an RNTN would try all of the next words and eventually vectors that represent entire sub parses. By doing this at every step of the recursive process, the net is able to analyze and score every possible syntactic parse.
- There can be different structures for the how to parse the words of a sentence for example. To pick the best one, the net relies on the score value produced by the root group. By using this score to select the best substructure of each step of the recursive process, the net will produce the highest scoring parse as its final output.
- Once the net has the final structure, it backtracks through the parse tree in order to figure out the right grammatical label for each part of the sentence. In our example, “The car” is labeled a noun phrase and “is fast” is labeled as a verb phrase. It then works its way up and adds a special label that signifies the beginning of the parse structure.
- RNTN are trained with back propagation by comparing the predicted sentence structure with the proper sentence structure obtained from a set of labeled training data. Once trained the net will give a higher score to structures that are more similar to the parsed trees that it saw in training.
- RNTN are used in natural language processing for both syntactic parsing and sentiment analysis.
- RNTN is also used to parse images typically when an image contains a scene with many different components.
