Understanding Deep Learning: DNN, RNN, LSTM, CNN and R-CNN

7 min readMar 21, 2019

Deep Learning for Public Safety

It’s an unavoidable truth that violent crime and murder is increasing around the world at an alarming rate, like in America murder rate is increased by 17% higher than five years ago. Among the murders that were occurred, about 73% of US murders are committed with guns, a proportion of which has increased in recent years.¹ World leaders are trying to clamp-down this certain situation with the help of their law enforcement system. Despite their efforts, sometimes things get out of control due to the lack of action in no time. But in such cases, we the tech giants can make an approach to ensure public safety using Deep Learning.

This can be demonstrated through a simple model where we are going to look at an active shooter and how an object detection system is going to identify a weapon, track the criminal and deploy a depth sensing localized drone to de-escalate with a pepper spray and then escalate using force by dropping down 3 feet to the group and deploying an electric shock weapon.

This figure is showing how a simple model that is developed using deep learning can be used to ensure public safety.

For attaining this model, we have to use Machine Learning. Questions may arise in your mind what is this Machine Learning and Deep Learning as most of the people just enjoy the benefits of technology but very few of them are aware or interested to know about the terms and how they work. Here we are going to give you a concise lucid idea about these terms.

What is Machine Learning?

Machine Learning is a subset of Artificial Intelligence and Deep Learning is an important part of its’ broader family which includes deep neural networks, deep belief networks, and recurrent neural networks.² Mainly, in Deep Learning there are three fundamental architectures of neural network that perform well on different types of data which are FFNN, RNN, and CNN.

Deep Neural Networks (DNNs)

Deep Neural Networks (DNNs) are typically Feed Forward Networks (FFNNs) in which data flows from the input layer to the output layer without going backward³ and the links between the layers are one way which is in the forward direction and they never touch a node again.

The outputs are obtained by supervised learning with datasets of some information based on ‘what we want’ through back propagation. Like you go to a restaurant and the chef gives you an idea about the ingredients of your meal. FFNNs work in the same way as you will have the flavor of those specific ingredients while eating but just after finishing your meal you will forget what you have eaten. If the chef gives you the meal of same ingredients again you can’t recognize the ingredients, you have to start from scratch as you don’t have any memory of that. But the human brain doesn’t work like that.

Recurrent Neural Network (RNN)

A Recurrent Neural Network (RNN) addresses this issue which is a FFNN with a time twist. This neural network isn’t stateless, has connections between passes and connections through time. They are a class of artificial neural network where connections between nodes form a directed graph along a sequence like features links from a layer to previous layers, allowing information to flow back into the previous parts of the network thus each model in the layers depends on past events, allowing information to persist.

In this way, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. But they not only work on the information you feed but also on the related information from the past which means whatever you feed and train the network matters, like feeding it ‘chicken’ then ‘egg’ may give different output in comparison to ‘egg’ then ‘chicken’. RNNs also have problems like vanishing (or exploding) gradient/long-term dependency problem where information rapidly gets lost over time. Actually, it’s the weight which gets lost when it reaches a value of 0 or 1 000 000, not the neuron. But in this case, the previous state won’t be very informative as it’s the weight which stores the information from the past.

Long Short Term Memory (LSTM)

Thankfully, breakthroughs like Long Short Term Memory (LSTM) don’t have this problem! LSTMs are a special kind of RNN, capable of learning long-term dependencies which make RNN smart at remembering things that have happened in the past and finding patterns across time to make its next guesses make sense. LSTMs broke records for improved Machine Translation, Language Modeling and Multilingual Language Processing.

Convolutional Neural Network (CNN)

Next comes the Convolutional Neural Network (CNN, or ConvNet) which is a class of deep neural networks which is most commonly applied to analyzing visual imagery. Their other applications include video understanding, speech recognition and understanding natural language processing. Also, LSTM combined with Convolutional Neural Networks (CNNs) improved automatic image captioning like those are seen in Facebook. Thus you can see that RNN is more like helping us in data processing predicting our next step whereas CNN helps us in visuals analyzing.

RNN or CNN: Which One is Better?

Though RNNs operate over sequences of vectors: sequences in the input, the output, or in the most general case both in comparison with CNN which not only have constrained Application Programming Interface (API) but also fixed amount of computational steps. This is why CNN is kind of more powerful now than RNN. This is mostly because RNN has gradient vanishing and exploding problems (over 3 layers, the performance may drop) whereas CNN can be stacked into a very deep model, for which it’s been proven quite effective.

But CNNs are not also flawless. A typical CNN can tell the type of an object but can’t specify their location. This is because CNN can regress one object at a time thus when multiple objects remain in the same visual field then the CNN bounding box regression cannot work well due to interference. As for example, CNN can detect the bird shown in the model below but if there are two birds of different species within the same visual field it can’t detect that.

While an R-CNN (R standing for regional, for object detection) can force the CNN to focus on a single region at a time improvising dominance of a specific object in a given region. Before feeding into CNN for classification and bounding box regression, the regions in the R-CNN are resized into equal size following detection by selective search algorithm. Therefore, it helps to specify a preferred object.

Are there any techniques to go one step further and locate exact pixels of each object instead of just bounding boxes? Yes, there is. Image segmentation is what Kaiming He and a team of researchers, including Girshick, explored at Facebook AI using an architecture known as Mask R-CNN which can satisfy our intuition a bit.

How Our Designed Model is Going to Work?

In the previously mentioned model, we have combined RNN and CNN to make R-CNN which performs as Mask R-CNN. It can identify object outlines at the pixel level by adding a branch to Faster R-CNN that outputs a binary mask saying whether or not a given pixel is part of an object (such as a gun). This helps with Semantic and Instance Segmentation and to eliminate Background Movement. Our approach uses Augmented Reality to Sense Space, Depth, Dimensions, Angle — like a localized GPS which may help us detecting the body pose of a shooter and from which we can predict what may happen next by analyzing previous data. The drone is used there for mobility, discovery, close proximity encounter to save lives immediately.

We found the iPhone A12 Bionic Chip a great edge decentralized neural network engine as the latest iPhone XS max has 6.9 billion transistors, 6-core CPU, 8-core Neural Engine on SoC Bionic chip and can do 5 trillion operations per second which is suitable for machine learning and AR depth sensing.

References:

1. US violent crime and murder down after two years of increases, FBI data shows,24/9/2018, The Guardian.

2. The definition “without being explicitly programmed” is often attributed to Arthur Samuel, who coined the term “machine learning” in 1959, but the phrase is not found verbatim in this publication and may be a paraphrase that appeared later. Confer “Paraphrasing Arthur Samuel (1959), the question is: How can computers learn to solve problems without being explicitly programmed?” in Koza, John R.; Bennett, Forrest H.; Andre, David; Keane, Martin A. (1996). Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming. Artificial Intelligence in Design ’96. Springer, Dordrecht. pp. 151–170.

3. Hof, Robert D. “Is Artificial Intelligence Finally Coming into Its Own?”. MIT Technology Review. Retrieved 2018–07–10.