Only Numpy: Deriving Forward feed and Back Propagation in Synthetic Gradient (Decoupled Neural Interfaces) with Interactive Code feat. iamtrask
So first thing! Merry Christmas Eve, and happy holidays. I hope everyone is have a lovely holiday. Anyways lets get right into it, and before reading there are 2 very important things I want you guys to know.
- I did not read the paper (Decoupled Neural Interfaces using Synthetic Gradients) as of yet! I wanted to take a jab on implementation first and move to the paper, I will read the paper soon and make my own implementation soon!!
- All of the mathematics in this post is based on iamtrask’s implementation of Synthetic Gradient please follow the link here to read the AMAZING tutorial.
Update: Forgot to add LR means Learning Rate
Before we start some notations that I wish to clear up, as well as our cost function is the equation you guys see above. Cost = Layer3-Y. Also, for now ignore the numbers at the top right corner, but they become very important very soon!!!! Now Network Architecture.
We only have three layers L1, L2, L3, and three Synthetic Gradient Generator L1SGG, L2SGG, L3SGG (The diamonds below each layer). Also dimensions are [1000,24], [1000,128], [1000,64],[1000,12] for input layer, layer 1, layer 2, layer 3 respectively. Now lets look at the dimensions for each weights and synthetic gradients.
That’s it! Again for the review WkSG -> Layer K Synthetic Gradient. Also, one last thing to note, the way I wrote Dot Product and Matrix Transpose might confuse some people.
Notation CIRCLE DOT IN THE MIDDLE is dot product notation, so the above equation is Layer1 = logistic function (dot_product(x,w1)).
Notation CIRCLE IN THE BOTTOM RIGHT is matrix transpose, so the above equation is dot_product(L1WSG,W1_Transposed)
NOW LETS PERFORM FORWARD FEED OPERATION AS WELL AS BACK PROPAGATION!!
Above is iamtrask’s implementation of synthetic gradient (again link here) as seen there is two part for each layer. And we will look at each operation separately. I’ll describe what they are doing and start with layer 1.
Forward and Synthetic Update: Performs regular forward feed operation, and update current weight using the sythentic gradient
Update Synthetic Weights: Using the true gradient update the synthetic gradient
So, there are many things going on here. Firstly, the blue numbers in the square box, describes the dimensionality of the resulted equation. Second….
C1 -> Forward and Synthetic Update
C2 -> Update Synthetic Weights
Basically in the C1 Step we are performing regular forward feed operation to get the output of each layer. For example Layer1 = log(dot_product(x,w1)), and right after the forward feed operation, we update the weight W1 by using the W1SG (the synthetic gradient).
Log() => Short for Logistic Function
Two very interesting to note from above image are first, the amazing fact how similar the process is when compared to regular gradient update. Second, the Return Statement written at the bottom. Don’t worry I will explain both.
So below is somewhat traditional method of performing gradient update.
There are three main components to get the derivative of cost function respect to certain weight, first using the chain rule, we need to get the derivative of the cost respect to output of a certain layer, derivative of output of certain layer respect to input of the same layer, and derivative of input of same layer respect to the weight. Now compare it with our forward feed process, it looks some what very similar! (This was obvious but I still though it was kinda cool)
Now the second part, the Return statement, if you read the original tutorial by iamtrask you will get this part. Basically, we declared each layer as a class and we want to use the output of next layer to update the weights for our current layer so we return some numbers. ( I won’t go into the math here, but if you see that L1WSG * W1_Transpose, this actually makes so much sense, very similar to traditional gradient update.)
Now lets see the operations for the rest of the layers!
Again, just to recap…
d1 -> Forward and Synthetic Update
d2 -> Update Synthetic Weight
e1 -> Forward and Synthetic Update
e2 -> Update Synthetic Weights
That’s it for forward feed operation and back propagation, however THAT’S NOT ALL! We now will observe the true power of Decoupled Neural Interfaces why is it so useful and where that power comes from.
As you have notices, for every operations there were a number and roman numerals symbols associated to them this is the order of operation we can perform and THIS is the reason why Synthetic Gradients are AMAZING. In traditional NN, every layer is locked and cannot be trained asynchronously. But since our network is decoupled, we can train in a different order. I’ll show you.
Red Box Ordering : When we ARE NOT using MULTI THREADING
Blue Box Ordering : When we ARE using MULTI THREADING
Red Box Ordering Case : 1 -> 2 -> 3 -> 4 -> 5 -> 6
In other words, right after we perform forward feed operations, we’ll pass resulted value to the next layer, wait for the next layer to return some value. Using that next layer value, perform Update Synthetic Weights. For example, after calculating layer 1, we can give the calculated layer 1 directly to layer 2, in which will give us the L2WSG * W2_Transpose. Then without hesitation we can perform Update Synthetic Weights on layer 1! We didn’t even have to wait until the last layer (layer 3) calculate the cost function! This is great, but we can even do better!
Blue Box Ordering Case : I -> II -> III -> IV -> V
Now we are using multi threading, this means we can perform multiple task at the same time. So lets look at an example again, after we calculate layer 1 value we will give that value to layer 2, in which again, will give us L2WSG * W2_Transpose. Here we can spawn a new process to calculate Update Synthetic Weights on layer 1 while passing on the resulted layer 2 calculation to layer 3! Now that’s sexy ;)
I have modified the original code from iamtrask to meet the Red Box and Blue Box situations. I’ll give you the code right away, however please do read the next section, the results of Red Box Case vs Blue Box Case.
Red Box Case vs Blue Box Case
In theory spawning a child process to train the network sounds sexy as hell, however my experiments resulted in a different outcome.
As seen above, Red Box method is exponentially faster than the Blue Box method (around 98.27 percent faster), this is a disappointment LOL and I honestly have no idea why this happens. But well, I will get back to this question after reading the original paper! (If you know why this happens please leave in the comment section below!)
This is my third post in medium and I am quite new to this, so constructive criticism would be very much appreciated, thanks! (But please do note, I wrote this post from 6 am to 11 am, because I have serious case of insomnia LOL so my Engrish is very bad.)
Thanks for reading!