The Shortest Introduction To Deep Learning You Will Find On The Web

Published in

Analytics Vidhya

4 min readJun 25, 2019

Simple Artificial Neural Network with three hidden layers. Source: https://orbograph.com/wp-content/uploads/2019/01/DeepLearn.png

The most promising development in machine learning (ML) is deep learning. In DL we build a artificial neural network (ANN), basically we use layers of linear regression, followed each by an activation function, that brings non-linearity into the ANN. Those activation functions are usually sigmoid functions like the logistic function (aka just sigmoid) or tangens hyperbolicusl (tanh). More recently the rectified linear unit ( ReLU: f(x) = max(x,0) ) has shown improved results in most cases. However it was disregarded first by mathematicians as it is not differentiable at f(0). It was shown at the beginning of the 2000s by Professor Hornik from WU university that those ANN architectures could approximate any functions, even with just one hidden layer, this is kown as universal approximation theorem (UAT).

Most common activation functions. Source: https://www.kdnuggets.com/wp-content/uploads/activation.png

This figure shows very nicely how after every layer of matrix multiplication an activation function is used. In this case sigmoid. Source: **https://www.bogotobogo.com/python/scikit-learn/images/NeuralNetwork2-Forward-Propagation/NN-with-components-w11-etc.png**

One interesting point in DL is that we can have data from any shape as input and output. This means data is data. It does not have to be some kind of special data that is domain specific. We can use the same techniques we use for stock price prediction, in medical diagnosis or customer churn prediction, with only some minor practical adjustments.

A vast problem in DL is optimization. Professor Hinton from the Toronto university has introduced the backpropagation algorithm in 1986. This algorithm was used to derive partial derivatives for every parameter of an ANN, regarding some cost function that is derived from a difference from the model output Y and the target output T. Often used cost functions include mean squared error for regression or the cross entropy for classification tasks. With modern libraries like pytorch or tensorflow derivation of the derivatives is handled automatically. We only have to define the forward pass, this is the ANN function. It consists of matrix multiplications (linear regression) for each layer and following activation functions that bring non-linearity to the ANN as mentioned before.

Matrix Multiplication. Source: https://ml-cheatsheet.readthedocs.io/en/latest/_images/dynamic_resizing_neural_network_4_obs.png

In DL we had and still have a huge problem of having too much variance in the data. This means we fit too much noise, which leads too often perfect in-sample, but much worse out-of-sample models. This is also due to UAT, as our ANN can fit any function perfectly in theory this leads often to almost perfect fits in-sample, however our ANN does not care about real dependencies. It just minimizes our cost function by describing the in-sample distribution. This problem is known as variance-bias trade off, it stayed a problem for quit a time and is still not fully resolved yet. However, new concepts like simply using more training data, transfer learning, normalization (batch, weight, layer, group, etc.) and weight sharing have shown to improve the generalization ability of those ANNs. The problem with this concepts however is that they do not adjust for concept shift, the change in the relationship between input X and output Y over time. However, (deep) reinforcement learning (RL) may yield a solution to this problem. In RL an agent interacts with a virtual or real environment and tries to maximize reward. Thereby this agent creates new data it can train own. Most notable here is the development of AlphaZero by Google Deep Mind that learned to play go, chess and soghi by playing against it self. In a few hours it was better in all of those games than any human or software player before. While RL is not exploited in business cases at the moment, further development could increase its scoop and importance and may be the best shot we got for developing artificial general intelligence (AGI).

Reinforcement Learning. Source: https://www.kdnuggets.com/images/reinforcement-learning-fig1-700.jpg

If you have any remarks or questions please feel free to use the comment section.

Some of my other articles that might interest you

Why Mathematicians hate Statistics and Machine Learning

Mathematicians hate statistics and machine learning because it works on problems mathematicians have no answer to. The…

medium.com

Ways to Change a Dollar: Counting in Python (Generative Functions and Series Multiplication)

Easy example first: We want to know how many ways there are to get 20 cents from nickels and dimes.

towardsdatascience.com

Build a Scalable Search Engine for Fuzzy String Matching

In fuzzy matching our goal is to score string A to string B in terms of how close they are together. We want to find…