My Absolute Nonsense Theory
So I was thinking, in Korea we have this word called 이열치열. It means fight fire with fire or force with force. My parents would tell me this during the Summer/Winter time, when the weather gets too hot/cold. The would tell me to eat or drink Hot Soup/Ice Cream to make the Heat/Cold go away. Thinking back, the logic doesn’t make any sense, but keep this idea in mind, while reading the next part.
One challenge for Neural Network is knowing the ground truth representation of the given data in the hidden layers. Simply put we don’t have the ground truth values for hidden layers, so naturally we don’t know how the data is supposed to be represented in those hidden layers.
But we need hidden layers for the network to perform complex tasks. Directly quoting from the paper published in 1986 “Learning internal representations by error propagation” by Dr. Hinton, Dr. Rumelhart and Dr. Williams.
This problem and many others like it cannot be performed by networks without hidden units with which to create their own internal representations of the input patterns.
Very good example is the classical XOR problem.
And this is my understanding of NN as well. The model is trying to learn or recognize patterns in the given data. Once certain pattern is learned from the training data. We can take advantage of this pattern to classify images or patient health status etc…..
However the data could contain noises that makes it harder for the model to learn proper weights that recognizes this pattern.
From here, I got myself thinking, can we cancel out noise in the data with noise?
Like how cosine wave and sine wave cancel each other, and like 이열치열, can we fight noise with noise? I wanted to test this idea.
I am not Good at Math, and this method does not have a solid proof on why this could be somewhat working. If any mathematician knows exactly why this happens (and possibly the theory behind this), please comment down below.
Data / Network Architecture / Forward Feed Operation
We are going to test Noise Training idea with simple classification problem. Given MNIST data, we are going to perform classification only on 0 or 1 images. And as seen in the right image, our neural network is a simple 3 layered network. All of our network share the exact same architecture along with same hyper parameters. Finally we are using Mean N2 cost function.
Proper Back Propagation
As seen above, we have our proper back propagation, where chain rules are kept and we are not any breaking any derivatives.
Noise Training Back Propagation
Red Box → Cost Function of our Network again, we are using Mean N2 cost.
Blue Box → This is the KEY DIFFERENCE in our network! We are generating RANDOM NOISE (In the above case we are generating Gumbel Distribution.) Why? Since we are going to use this as our gradient, to update our weights.
Gray Box → Here we are going to update our weights. Please note the DIFFERENCE between (w3g,b3g) (w3g,b2g) and (w1g,b1g). If we let the ‘Cost’ to be a signal to the network, of how the network is doing. Then we can easily assume this. As the network performs well the signal decays, so the update on weights becomes smaller.
For (w3g,b3g) we are multiplying 0.01 → To give small signal to weights
For (w2g,b2g) we are multiplying 0.1 → To give medium signal to weights
For (w1g,b1g) we are multiplying 1.0 → To give large signal to weights
Green Box → Step Wise Learning Rate Decay
List of used Random Distribution
Including Back Propagation, there total of 14 methods to train each network.
a → Gumbel Distribution
b → Gaussian Distribution
c → Standard normal Distribution.
d → Binomial Distribution
e → Beta Distribution
f → Poisson Distribution
g → Zipf Distribution
h → Pareto Distribution
i → Power Distribution
j → Rayleigh Distribution
k → Triangular Distribution
l → Weibull Distribution
m → Noncentral Chisquare Distribution
n → Our Favorite Back Propagation
More detailed description of each distribution can be found on Scipy doc, to view them please follow this link. Also, please note the alphabet symbol for each distribution, since I will refer them to represent the network.
Training Result — Number of Epoch 100
As seen above, n (Back propagation) took the most time to train, for other networks, it was around 77ish seconds to 101ish seconds. This is already expected since, back propagation needs to compute lot of derivative before updating the weights.
Training Result — # of Misclassified Images among 20 Test Images
Now I know the test set only contains 20 images, however I was still very amused by the fact that a (Gumbel Distribution) network and e (Beta Distribution) network performed quite well. Having only 2 misclassified images. However, of course both of them wasn’t even close to achieving accuracy of back propagated model n, which had 100% accuracy.
So I guess we made some kind of trade off. Training a Neural Network on Back Propagation took the longest time however had highest accuracy. While training with Noise, had shorter training time with so so accuracy.
Warning! As seen in the Red Box, total run time of this whole program is around 1349 seconds. So grab a coffee while you run it. And the program will generate images after running all of the network, you can access all of them in the tab shown on the green box.
To access the interactive code, please click this link.
Warning 2. So I just noticed that if you are no a subscriber in Repl (Interactive code website) you can’t fully run the code. Another way to run this code is download all of the files from the web and run it on local settings.
Update: I moved to Google Colab for Interactive codes! So you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Happy Coding! However after moving I notice something very different, since Tensorflow’s MNIST Dataset is arranged in different orders, when training the results would differ from whats some above. Changing the random seed value for Numpy helped to either improve or worsen the results for me. Finally, I will never ask for permission to access your files on Google Drive, just FYI.
Two things I wish to make note of. First, I think back propagation is one of the best training method for Deep Neural Network, and strongly believe it will continuously be one of the best in the future as well.
Second, this whole experiment was to see if it is possible to ‘somewhat’ train a neural network without back propagation, and I think we did to certain extent. However, there is one HUGE problem. If we plot the cost values for each network we get a graph like below.
As seen above, a (Gumbel Distribution) network stops learning around cost value of 0.2 (Red Box). While n (Back Propagation) network is able to bring down the cost to 0. (Green Box) So I don’t think Vanilla Gradient Decent is a proper optimization algorithm for Noise Training. It would be REALLY satisfying to know how to get out of local minimum points effectively as well as the exact math behind all of this.
If any errors are found, please email me at firstname.lastname@example.org.
- Random sampling (numpy.random)¶. (n.d.). Retrieved February 01, 2018, from https://docs.scipy.org/doc/numpy/reference/routines.random.html
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal representations by error propagation (No. ICS-8506). California Univ San Diego La Jolla Inst for Cognitive Science.
- T. (2009, September 17). Know Your Distribution Types. Retrieved February 01, 2018, from https://www.youtube.com/watch?v=-PwugYB9Zjs
- V. (2011, January 07). Maths Tutorial: Describing Statistical Distributions (Part 1 of 2). Retrieved February 01, 2018, from https://www.youtube.com/watch?v=achLJ8PRyBw
- C. (2014, June 30). Understanding Random Variables — Probability Distributions 1. Retrieved February 01, 2018, from https://www.youtube.com/watch?v=lHCpYeFvTs0