Neural networks that grow

Shamoon Siddiqui
Shallow Thoughts about Deep Learning
4 min readDec 30, 2018

Overview

I won’t get into the basics of what a neural network is because there is a TON of resources out there. If you don’t know what a neural network is, I suggest reading the following:

  1. Basics of Neural Network
  2. A Visual and Interactive Guide to the Basics of Neural Networks
  3. Neural Network Simplified

Go ahead. I’ll wait.

.

.

Done? Awesome! Welcome back! Now that you’re familiar with what neural networks are and how they work, you probably have a lot of questions about the various hyperparameters. What are hyperparameters? Things like how many hidden layers there are, how many units in a hidden layer, the learning rate, mini-batch size, momentum and so on. There are a ton of articles out there on how to tune these various hyperparameters and a large corpora of “best practices” out there.

The thought occurred to me (and I’m sure countless others) on how to best tune these parameters. I thought it might be a good idea if the network started out simple and grew as complex as needed (but not any more than that). So for the purposes of this thought experiment, we’re only considering the number of hidden layers and the number of units per hidden layer.

Error Rate Over Time

Figure 1: Typical error rate over a number of iterations

As neural networks continue to train over a large number of iterations, it’s not uncommon to see a graph that resembles the one shown in Figure 1. There’s usually a large drop off in error and then it tapers of gradually. We can see right around x=200, there seems to be an inflection point. The error rate changes more and more slowly per iteration. What if at that inflection point, the network changed its topology?

Start Simple

Figure 1: Basic Fully Connected Neural Network

We are assuming we have the following network:

  • 5 inputs
  • 1 hidden layer with 5 units
  • 1 output unit
Figure 2: New Hidden Layer Added

Once the network has been trained and the inflection point is reached on the error curve, we should pause training, and add a new hidden layer. We would re-initialize the weights for the second hidden layer, W2, so that the weights values from W1 pass straight through to the output layer. Training then resumes, complete with back propagation.

Continue Adding Complexity

Figure 3: A stepped error curve

The idea is that as more and more layers are added, the learning accelerates and we end up with an error curve that looks something like the one shown in Figure 3.

As soon as an inflection point is reached with a certain network topology, it is changed (made more complex), so that more learning can happen quicker.

The learning will eventually stop accelerating with each layer addition, but I think it’s an interesting idea nonetheless. Some sort of stopping condition would need to be specified so that the network doesn’t grow to be pointlessly huge.

Benefits

To my naive and simple mind, it would seem that the training would happen a lot quicker on smaller networks. Deep learning works because there’s considerable complexity in the many layers. As of 2016, it seems the deepest neural network had 152 layers. I don’t know what the input /output size was, but I’m sure the computational complexity was intense.With the sort of dynamic topological changes that I’m proposing, I think training would have gotten done faster and perhaps we’d see that the incremental benefit of each additional layer diminishes rapidly as the network grows.

Additional Optimizations

Currently, we’re only considering adding hidden layers at each inflection point. Other things we could do are:

  • Change the learning rate (although this is already done with learning rate decay)
  • Adding additional units to hidden layers
  • Changing activation functions. This can be done on an entire layer or individual neurons.

Other Triggers for Topological Changes

We’re using the error curve as our trigger for when to change the topology, but other triggers could include:

  • Bias
  • Variance
  • Vanishing Gradient
  • Exploding Gradient
  • Something else?

Conclusion

This is all a thought experiment for now. One day, I may implement this and see what happens. In the mean time, if anyone has any insight, thoughts or research, please share!

To read more about deep learning, please visit my publication, Shamoon Siddiqui’s Shallow Thoughts About Deep Learning.

--

--

Shamoon Siddiqui
Shallow Thoughts about Deep Learning

Building products + communities with code. Entrepreneur with more losses than wins. Lifelong learner with a passion for AI+ML / #Bitcoin.