Deep Learning

Intro & Supervised Learning,

The aim of this article is to discuss: deep learning, types of deep learning, deep learning techniques, applications, and recent achievements.

At present, machine learning techniques have become a popular “tech tool” of modern society. This is mostly applied in web based applications, such as search engines, e-commerce, social media, government and administration, health and welfare, medical and genomes, consumer electronics and industrial automation; in fields of computer vision, pattern and object recognition, probabilistic estimation, decision making, speech recognition and text generation, natural language processing, and customer-product optimization.

Conventional machine learning techniques have performance limitations in processing data in their raw form (image pixels, for example) and required expert knowledge in designing feature extractors to transform raw data into more abstract levels of representations. Basic deep learning architecture is constructed as a stack of multiple layers of learnable modules which can learn to map inputs to outputs. This process transform represent from one level to more abstract levels. Complexity of learning depends on transformations, but these are not explicitly hand engineered. They are trained by themselves using general purpose learning procedures like backpropagation and training data to develop representations. In present, Deep learning methods have reached to the level of understanding and solving intricate structures and functions with high dimensional data. Hence, it can be successfully applied on various domains of science, commerce, consumer products and variety of fields in many ways.

Supervised Learning, is the most common and popular form of machine learning techniques. The idea is to be trained the neural networks by using previously known data, after feeding training example to the network it computes and compares the error between estimated output with desired output and modifies the internal parameters accordingly (often called as weights) by means of reducing the error. These adjustable internal parameters and learning procedures can be simply equivalent to set of knobs that defines the accuracy of input-output function of a machine. At the beginning only binary linear-classifiers were introduced, but with the technology advancement, task with classifying by multiple complex score classes, learning complex features while insensitive to irrelevant variations of features, and generalization were required. This implies the requirement of good feature extractors and powerful training.

Backpropagation to train multi-layer architecture,

Introduction of stochastic gradient descent to train multilayer architecture has successfully replaced hand engineering features. Introduction of Backpropagation algorithm is a fast and successful method to compute gradients of functions of inputs and weights. Backpropagation works significantly faster than earlier approaches of learning and it has become workhorse of learning in neural networks today.

From the basic idea of neural network architecture learning procedure, to transform data from input layer to output layer, it computes weighted sum of inputs from the previous layer and pass it to next layer through non-linear function, here most commonly used function is Rectified Linear unit(ReLu). Backpropagation is a method to calculate the gradient of the cost function with respect to the weights in the neural network, where cost function gives the error between outputs of estimated and actual. Target is to minimizes the cost function (minimizing the error) by tweaking the weights (and bias) and optimize the network. This is also known as gradient decent algorithm. Even though backpropagation is fast and good at learning, gradient descent drawbacks like it could converge into local minima which no small change would reduce the average error or may ending up with saddle points on plot landscape where gradient becomes zero.

In 2006, CIFAR introduced an unsupervised learning procedure that could create layers of feature detectors without requiring labeled data for training. These feature detectors have ability to model or reconstruct outputs from previous layer, which is an effective way of training very deep neural networks by pre-training one hidden layer at a time using the unsupervised learning procedure for restricted Boltzmann machines. Recognizing hand written digits and speech recognition successfully applied these methods with help of fast computational power of GPUs. These speech recognition tasks stared with simple and small vocabulary to larger vocabulary tasks. Hence they deployed to android smart phone platforms for general use (with small datasets) and these unsupervised pre-training methods leads to better generalization.

Convolution Neural Networks and Image Understanding,

Convolution Neural Networks (ConvNets) is a type of Artificial neural networks which process data with form of multiples arrays and perform recognition or classification tasks of such like images(mostly), audio spectrograms and Video data. From this article ConvNet architecture is briefly explained according to its structure and composed layers. Basically Convolution neural networks composed of 4 types of layers: Convolutional, Pooling, Rectified Linear units (ReLu) and Fully connected layers. Most important layers and which discussed in this article are Convolutional layer and Pooling layer. Throughout (from the basic architecture) the alternatively (initial three layers) stacked layers and final fully connected of the network it transforms the original input (an image for example) layer by layer from the original raw data values to the final class scores.

Convolution layer organized as a feature map which perform activation when it slides over the input array (image pixel array for example) or input volume from previous layer and response when consisting filters of convolutional layer detects a feature. An image for example, these filters will activate when it detects a visual feature like an edge (on some orientation) or blotch in some color. In each feature map, local connections with the input volume are connected with a set of weights which tweak and optimize the network during the learning process. These learned weight parameters then use to computes the weighted sum of inputs and feed to the next layer through non-linear activation function.

The role of Pooling layer is to shrink down high dimensional features representation while preserving its most important information. Typically pooling units computes the maximum of selected local patch unit and hence reduce the spatial dimensionality of output which the input volume to the next layer (mostly to another convolution layer). This is important for reducing parameters and helpful in managing computational load.

Training such Neural network done using Backpropagation algorithm which by adjusting the weights in filter banks of convolutional layers. Convolution neural networks perform classification tasks basically by learning higher level features by composing lower level features through set of layers. An image for example, staring from edges (in image pixel level) in certain orientation in space, then their local combination to form somewhat pattern and so on to develop more complicated motifs like human faces, which up to higher level of representation, through the stack of layers. It can find ConvNets application starting from 1990’s such like hand writing recognition and face detection and recognition systems.

Apart from ConvNets spectacular achievements in ImageNet 2012, today it has approach human performance at some tasks and functions. One of the breakthrough is accurate image captioning and describing by combination of Convolutional neural nets and Recurrent neural nets. Furthermore, this success achieved because of increased computational power, GPUs in particular, regularization techniques like Dropout and new methods of generating more training examples from existing (like distortion, orientation and position changing). Major tech companies like Google, Facebook and Microsoft, as well as number of startups formed and started research and development based on recently developed outstanding performance of convolution neural nets. Also, development of specialized hardware (called ConvNet chips) enables CovNets to perform well on applications such like real time vision and understanding on smart phones, robotics, cameras and autonomous vehicles.

Recurrent Neural Networks and Language processing,

Discussing about Language processing and understanding using deep learning; Next word prediction of a word sequence from previously defined context of words can be performed using trained multilayer neural network, which is starting from first layer to create word vectors and following layers to convert the input to output that predicts the probability for any word in the vocabulary to appear as the next word. Network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word.

Recurrent Neural Networks, another major types of neural networks which are successfully capable of processing sequential data. Recurrent neural nets process an input sequence one element at a time while its hidden units (loops) in them that allow information to be stored and carried across neurons while reading the input. Because of each neuron or unit ability to use its internal memory to maintain information about the previous input information, which important to keep an index of the history that all the past elements of the sequence (Like human reading a sentence). By the advance of the architecture and ways of training it had become recurrent neural nets very good at predicting next character at a text and more complex tasks like language translation and understanding, by training the network to encode the words of a sentence (processing one at a time) from one language and then develop the representation to express the meaning, finally decode by jointly trained another language decoder network to words or sentence from another language. Google Translate is a popular application of above.

Starting from translating one language to another it can translate meaning or describing of an image to a readable language format, also known as image captioning. In this application a convolutional neural networks becomes the encoder network to read and understand the image, and a recurrent neural network becomes the decoder network for translation and language modelling.

In a recurrent neural network all the layers share same weights, noticeable when unfolding a simple network. Even though its main purpose of learning long term dependencies, it’s difficult to learn to store information in neurons for longer duration. Augmenting the network with an explicit memory like long short-term memory (LSTM). These memories proven much efficient than conventions networks, comes with special hidden units which contains special memory cell as and accumulator, connected to itself at the next time step and weight of one. With time recurrent neural networks have advanced in many ways like augment RNNs with memory modules, Algorithms to applications like neural turing machines and networks which can output or express the abstract or meaning or answering the questions after reading and learning from set of texts of sentences (paragraph).


Even though supervised learning achieved great success in the field of deep learning over the time. The unsupervised learning is the basic learning technique of living creatures that is mostly expected to be developed in longer term. Computer vision and recognition with Convolutional neural networks and their advances discussed, Recurrent neural networks with natural language processing and translation discussed while it is expected to develop Recurrent Nets to read and understand whole texts and develop their strategies and extended memory capabilities for more performance. Furthermore, develop combined system of Convolutional nets and Recurrent nets that use reinforced learning and achieve outperforming systems from existing technologies are expected. By ending, it is also expected to improve and develop combining systems of deep learning and reinforced learning, which is already at infancy level.

Over the decades supervised learning is developed but it is really important to develop unsupervised learning and reinforced learning. Today number of major companies like Google and Facebook and number of startups working on reinforced learning and unsupervised learning. Deep learning achieved lots of success with the help of big amount of data and high computational power like GPUs. However, it is expected to be to more developments in filed of deep learning more on unsupervised learning, reinforced learning and Transfer learning.