A Beginners Guide to Deep Learning

Kumar Shridhar
May 26, 2017 · 11 min read


Machine-learning technology powers many aspects of modern society: from
web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search.
Increasingly, these applications make use of a class of techniques called deep

Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data. In a simple case, you could have two sets of neurons: ones that receive an input signal and ones that send an output signal. When the input layer receives an input it passes on a modified version of the input to the next layer. In a deep network, there are many layers between the input and output (and the layers are not made of neurons but it can help to think of it that way), allowing the algorithm to use multiple processing layers, composed of multiple linear and non-linear transformations.[1][2][3][4][5][6][7][8][9]

Deep Learning has revolutionized the machine learning recently with some of the great works being done in this field. These methods have dramatically
improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. But, the ancient term “Deep Learning” was first introduced to Machine Learning by Dechter (1986)[10], and to Artificial Neural Networks (NNs) by Aizenberg et al (2000)[11]. It was further popularized by the development of Convolutional Networks Architecture by Alex Krizhevsky named ‘AlexNet’ that won the competition of ImageNet in 2012 by defeating all the image processing methods and creating a way for deep learning architectures to be used in Image Processing.[12]


  1. Generative deep architectures, which are intended to characterize the
    high-order correlation properties of the observed or visible data for
    pattern analysis or synthesis purposes, and/or characterize the joint
    statistical distributions of the visible data and their associated classes. In
    the latter case, the use of Bayes rule can turn this type of architecture
    into a discriminative one.
  2. Discriminative deep architectures, which are intended to directly provide discriminative power for pattern classification, often by characterizing the posterior distributions of classes conditioned on the visible data; and
  3. Hybrid deep architectures, where the goal is discrimination but is
    assisted (often in a significant way) with the outcomes of generative
    architectures via better optimization or/and regularization, or
    discriminative criteria are used to learn the parameters in any of the
    deep generative models in category 1) above. [13]

Despite the complex categorization of the deep learning architectures, the
one’s that are in practice are deep feed-forward networks, Convolution
and Recurrent Networks.


Deep feed-forward networks, also often called feed-forward neural networks, or multilayer perceptrons (MLPs), are the quintessential deep learning models.

The goal of a feed-forward network is to approximate some function f∗. For
example, for a classifier, y = f∗ (x) maps an input x to a category y. A feedforward network defines a mapping y=f(x;θ) and learns the value of the
parameters θ that result in the best function approximation.[1]

In simple terms, the network can be defined as input, hidden and output nodes with data coming in from input nodes, processing is done in hidden nodes and then the output is produced through output nodes. The information flows through the function being evaluated from x, through the intermediate computations used to define f, and finally to the output y. There are no feedback connections in which outputs of the model are fed back into itself and hence the models is called as feed-forward network. The model is shown in Figure [1].

Figure [1]: Feed-forward neural network[14]



In machine learning, a convolutional neural network (CNN, or ConvNet) is a type of feed-forward artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex.

Individual cortical neurons respond to stimuli in a restricted region of space
known as the receptive field. The receptive fields of different neurons partially overlap such that they tile the visual field.

The response of an individual neuron to stimuli within its receptive field can be approximated mathematically by a convolution operation.[15] Convolutional networks were inspired by biological processes[16] and are variations of multilayer perceptrons designed to use minimal amounts of preprocessing.[17]

They have wide applications in image and video recognition, recommender
systems[18] and natural language processing.[19]

LeNet was one of the very first convolutional neural networks which helped
propel the field of Deep Learning. This pioneering work by Yann LeCun was
named LeNet5 after many previous successful iterations since the year 1988.
At that time the LeNet architecture was used mainly for character recognition tasks such as reading zip codes, digits, etc.

Figure[2]: A simple convolutional neural network model

There are four main components in a ConvNet shown in Figure 2 above:
1. Convolutional Layer
2. Activation Function
3. Pooling Layer
4. Fully Connected Layer

Convolutional layer

Convolutional Layer is based on the term ‘Convolution’, which is a
mathematical operation performed on two variables (f*g) to produce a third
variable. It is similar to cross-correlation. The input to a convolutional layer is a m x m x r image where m is the height and width of the image and r is the number of channels, e.g. an RGB image has r=3. The convolutional layer will have k filters (or kernels) of size n x n x q where n is smaller than the
dimension of the image and q can either be the same as the number of
channels r or smaller and may vary for each kernel. The size of the filters gives rise to the locally connected structure which are each convolved with the image to produce k feature maps of size m−n+1.8 [20]

Activation Function

To implement complex mapping functions, activation functions are needed, that are non-linear in order to bring in the much needed non-linearity property that enables them to approximate any function. Activation functions are also important for squashing the unbounded linearly weighted sum from neurons.
This is important to avoid large values accumulating high up the processing
hierarchy. A lot of activation functions are present that can be used with some of the primarily used ones being sigmoid, tanh and ReLU.

Pooling Layer

Pooling is a sample-based discretization process. The objective is to down-sample an input representation (image, hidden-layer output matrix, etc.), reducing it’s dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned.

This is done to in part to help over-fitting by providing an abstracted form of
the representation. As well, it reduces the computational cost by reducing the number of parameters to learn and provides basic translation invariance to the internal representation.

Some of the most prominently used pooling techniques are Max-Pooling, MinPooling and Average-Pooling.

Figure[3]: Example of Max-Pooling of 2*2 filters

Fully Connected Layer

The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer. The Fully Connected layer is a traditional Multi Layer Perceptron that uses a softmax activation function or any other similar function in the output layer. [21]


In a traditional neural network we assume that all inputs (and outputs) are
independent of each other. But for many tasks that’s a very bad idea. If you
want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. [22]

A RNN has loops in them that allow information to be carried across neurons while reading in input. In the figure[4], x_t is some input, A is a part of the RNN and h_t is the output. Essentially you can feed in words from the sentence or even characters from a string as x_t and through the RNN it will come up with a h_t.s. Some of the types of RNNs are LSTM, Bidirectional RNNs, GRUs and more.

Figure[4]: Model of a RNN[30]

RNNs can be used in NLP, Machine Translations, Language Modeling, Computer vision, Video Analysis, Image generation, Image captioning, and so on due to the fact that any number of inputs and outputs can be fixed in a RNN making it one to one — many to many model possible. Some of the possible architectures are shown in figure[5] with possible explanation of the models.

Figure [5]: RNN depicting operations over sequence of vectors[23]


There has been a lot of research going on in the field of Deep Learning and a
lot of unique questions are solved using deep learning models. Some of the
best applications of deep learning are:

Colorization of Black and White Images

Deep learning can be used to use the objects and their context within the
photograph to color the image, much like a human operator might approach
the problem. the approach involves the use of very large convolutional neural networks and supervised layers that recreate the image with the addition of color.[24][28]

Machine Translations

Text translation can be performed without any pre-processing of the sequence, allowing the algorithm to learn the dependencies between words and their mapping to a new language. Stacked networks of large LSTM recurrent neural networks are used to perform this translation.[25]

Object Classification and Detection in Photographs

This task requires the classification of objects within a photograph as one of a set of previously known objects.
State-of-the-art results have been achieved on benchmark examples of this
problem using very large convolutional neural networks. A breakthrough in this problem by Alex Krizhevsky et al.[12] results on the ImageNet classification problem called AlexNet.[28]

Automatic Handwriting Generation

This is a task where given a corpus of handwriting examples, generate new
handwriting for a given word or phrase.
The handwriting is provided as a sequence of coordinates used by a pen when the handwriting samples were created. From this corpus the relationship between the pen movement and the letters is learned and new examples can be generated ad hoc.[26][28]

Automatic Game Playing

This is a task where a model learns how to play a computer game based only
on the pixels on the screen.
This very difficult task is the domain of deep reinforcement models and is the breakthrough that DeepMind (now part of Google) is renown for achieving.[27] [28]

Generative Model Chatbots

A sequence to sequence based model was used to create a chatbot which learned to generate it’s own answers when trained on a lot of real live conversational datasets. To know more in detail, visit the link.


It can be concluded from the article that the deep learning models can be used in a variety of tasks due to their capability of simulating the human brain. A lot of research has been done in the area and a lot of research is going to be done in near future. Although, trust issues are there at the moment, but things will be more clear in the near future.


  1. Ian Goodfellow, Yoshua Bengio, and Aaron Courville (2016). Deep
    Learning. MIT Press. Online
  2. Deng, L.; Yu, D. (2014).”Deep Learning: Methods and Applications”(PDF).Foundations and Trends in Signal Processing.7(3–4): 1–199.doi:10.1561/2000000039.
  3. Bengio, Yoshua (2009).”Learning Deep Architectures for
    AI”(PDF).Foundations and Trends in Machine Learning.2(1): 1–
  4. Bengio, Y.; Courville, A.; Vincent, P. (2013). “Representation Learning: A
    Review and New Perspectives”.IEEE Transactions on Pattern Analysis and
  5. Schmidhuber, J. (2015). “Deep Learning in Neural Networks: An
  6. Bengio, Yoshua; LeCun, Yann; Hinton, Geoffrey (2015). “Deep
    Learning”.Nature.521: 436–444.doi:10.1038/nature14539.PMID26017442.
  7. Deep Machine Learning — A New Frontier in Artificial Intelligence Research– a survey paper by Itamar Arel, Derek C. Rose, and Thomas P. Karnowski. IEEE Computational Intelligence Magazine, 2013
  8. Schmidhuber,Jürgen(2015).”DeepLearning”.Scholarpedia.10(11):32832.d
  9. Carlos E. Perez.”A Pattern Language for Deep Learning”.
  10. R. Dechter (1986), University of California, Computer Science
    Department, Cognitive Systems Laboratory.
  11. I. Aizenberg, N.N. Aizenberg, and J. P.L. Vandewalle (2000). Multi-Valued and Universal Binary Neurons: Theory, Learning and Applications. Springer Science & Business Media
  12. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet
    classification with deep convolutional neural networks.”Advances in
    neural information processing systems. 2012.
  13. Deng,Li, Three Classes of Deep Learning Architectures and Their
    Applications: A Tutorial Survey, Microsoft Research, Redmond, WA 98052, USA.
  14. Feed forward Neural Network
  15. “Convolutional Neural Networks (LeNet) — DeepLearning 0.1
    documentation”. DeepLearning 0.1. LISA Lab. Retrieved 31 August 2013.
  16. Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003). “Subject independent facial expression recognition with robust face detection using a convolutional neural network”(PDF). Neural Networks. (5): 555–559. doi:10.1016/S0893–6080(03)00115–1.
  17. LeNet-5, convolutional neural networks
  18. van den Oord, Aaron; Dieleman, Sander; Schrauwen, Benjamin (2013–01- 01). Burges, C. J. C.; Bottou, L.; Welling, M.; Ghahramani, Z.; Weinberger, K. Q., eds. Deep content-based music recommendation (PDF). Curran Associates, Inc. pp.2643–2651
  19. Collobert, Ronan; Weston, Jason (2008–01–01). “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask
    Learning”.Proceedings of the 25th International Conference on Machine
    Learning. ICML ’08. New York, NY, USA: ACM: 160–167.doi:10.1145/1390156.1390177. ISBN 978–1–60558–205–4.
  20. CNN
  21. Intuitive Explaination ConvNets
  23. RNN effectiveness
  24. Cheng, Zezhou, Qingxiong Yang, and Bin Sheng. “Deep colorization.”
    Proceedings of the IEEE International Conference on Computer Vision.
  25. Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. “Sequence to Sequence
    Learning with Neural Networks.” arXiv preprint arXiv:1409.3215 (2014).
  26. Graves, Alex. “Generating sequences with recurrent neural networks.”
    arXiv preprint arXiv:1308.0850 (2013).
  27. Mnih, Volodymyr, et al. “Playing atari with deep reinforcement learning.” arXiv preprint arXiv:1312.5602 (2013).
  28. Application Deep Learning
  29. Generative Model Chatbots
  30. Understanding LSTMs

Kumar Shridhar

Written by

Helping Machines in their quest to rule us! | NLP | Computer Vision| BotSupply.ai | kumar-shridhar.github.io

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade