Only Numpy: Decoupled Recurrent Neural Network, modified NN from Google Brain, Implementation with Interactive Code

Jae Duk Seo
Feb 3, 2018 · 5 min read

So I was talking to one of my friend Michael, who is also very interested in Machine Learning as well, about Google Brain’s Decoupled Neural Interfaces Using Synthetic Gradients. And he inspired me this idea, Decoupled Recurrent Neural Network, can we make it? But more importantly will it even perform well?

Before moving on IamTrask covered Decouple NN by Google Brain REALLY well. So check out his tutorial, it’s amazing. And I did my crappy take on Decouple NN as well, here.

Preparing Data / Declaring Hyper Parameter

As seen above, we are going to perform simple Classification Task on MNIST data set only for images 0 and 1. And we are going to process it as a vector format, not (28*28) image format. Also, please take note on the Green Box Regions, because those are our two Gradient Jesus telling us the true gradient.

Network Architecture

As seen above, very simple and plain. Before moving onto LSTM or GRU, I wanted to take my first baby step with vanilla RNN. But as you can see highlighted in Green Box, we have our two oracle, telling us what the true Gradient is.

Forward Feed / Back Propagation / Gradient Update

Pink Box → Every RNN have notion of Time Stamp, in this case, since there is A LOT going on with each time stamp. I decided to call them Time Box LOL.

Red Box → Order of execution. Please take note that we don’t have to fully wait on forward feed process to finish before updating the weights, as seen in box 2,4,5 and 7.

Orange Star → True Gradient for each Time Stamp, we need this value to update our Oracle (Green Diamonds, that you saw in the first image) at each time stamp. Also, the math equation to get this value is very long, so I will show them as a code implementation in the next section.

True Gradient for Time Box 1 /2— Orange Star A / B

Red Box → Forward Feed Process for Time Box 2

Blue Box → Back Propagation for Wrec and Wx at Time Box 2 using Synthetic Gradient 2 (w_sg_2).

Orange Box → Updating Synthetic Gradient 1(w_sg_1), with the True Gradient obtained from Time Box 2

And for updating our Synthetic Gradient 2(w_sg_2), we use the exact same concept but using the fully connected layer.

As seen above, very similar process but with Gradient from Time Box 3.

Training and Results

I did multiple training and runs since training was VERY inconstant. Even with the same np.random.seed() value, sometimes the network would able to converge very well as seen in the left image, or fly off, as seen in the right image.

One very interesting case was the cost spiked and came some what down. However overall the model well with the test set of classification of 20 images.

With several tries 100 accuracy was also possible.

Interactive Code

I moved to Google Colab for Interactive codes! So you would need a google account to view the codes, also you can’t run read only scripts in Google Colab so make a copy on your play ground. Finally, I will never ask for permission to access your files on Google Drive, just FYI. Happy Coding!

To access the code, please click here.

Final Words

I am constantly amazed by the fact that we still can train a NN without following the strict Back Propagation rule. I also made a post about ‘some what’ training a NN with random noise distributions.

If any errors are found, please email me at

Meanwhile follow me on my twitter here, and visit my website, or my Youtube channel for more content. I also did comparison of Decoupled Neural Network here if you are interested.


  1. Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., & Kavukcuoglu, K. (2016). Decoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343.
  2. Deep Learning without Backpropagation. (n.d.). Retrieved February 03, 2018, from
  3. Seo, J. D. (2017, December 24). Only Numpy: Deriving Forward feed and Back Propagation in Synthetic Gradient (Decoupled Neural… Retrieved February 03, 2018, from
  4. Seo, J. D. (2018, February 01). Only Numpy: Noise Training — Training a Neural Network without Back Propagation with Interactive… Retrieved February 03, 2018, from

The Startup

Get smarter at building your thing. Join The Startup’s +730K followers.