Optimising Neural Nets With Biomimicry
This is part two in a series on studying algorithms inspired by nature. See part one here where we look at the implications that arise when we select an algorithm to solve a target problem. The next two parts, part three and part four can be found at their links.
As mentioned in the first post, I’ll be trying to take the topics learnt in Natural Computing taught at Goldsmiths and applying them to or against neural networks where possible, to keep things interesting and different from the other posts. Towards the end of this post, I will attempt to weave the first post into this one, by reviewing how the No Free Lunch theorem applies to the suitability of Dispersive Flies Optimisation to solving network weights.
Dispersive Flies Optimisation
An integral inspiration for optimisation and search problems is nature. Dispersive Flies Optimisation, or DFO for short, is inspired by flies swarming behaviour over food sources.
Swarms of flies may comprise of a single fly, to thousands or countless millions, depending on the species. When disturbed, swarms of flies disperse and then regather, which is summarises the behaviour of the algorithm, the formation and then weakening of the swarms. For each member of the swarm, the position vectors of each member are defined as such that:
The first thing to note is that the swarm is essentially a matrix with the shape ‘NP’ by ‘D’. In the above equation, ‘t’ is the current time step, ‘D’ is the dimensionality of the problem space and ‘i’ is the member of the swarms population that has a total size of ‘NP’. When ‘t’ is 0 (it’s the first iteration,) each fly is initialised as follows:
The initialisation is very simple — Assign the position a random locations across all dimensions. To be more specific; for every element ‘i’ in the flies position, assign the element the lower bounds of the d’th dimension plus a random number ‘r’ drawn from a uniform distribution scaled by the upper and lower bounds of that given dimension.
Each iteration, the fitness each fly in the swarm is computed, and the swarm’s best fly is stored. As a deviation from the initial algorithm, I found it prudent to also store the best solution over all iterations too. Once the iterations best fly has been computed, the fly’s new position is calculated:
So essentially there are a few things going on here, but it’s not that complex. Firstly, if a random number drawn from a uniform real distribution is less than ‘dt’, which is our dispersive threshold, the i’th flies d’th dimension will be dispersed and reinitialised to a random position in the search space. This stochasticity is essential to prevent the flies from prematurely optimising to local minima. If the random number is not within the disturbance threshold, the i’th position is set to the best nearest neighbour fly in the array of flies (left or right), and then moved a random bit along towards the iterations best performing fly.
One of the really fascinating things about this algorithm and its swarm intelligence comrades is the emergent, rich and deeply complex behaviour that arises from the simple rules that each individual fly must follow.
Shut Up And Show Me The Code
Below we can see a toy example of this algorithm in Python, optimising the flies until they find an acceptable solution for the target vector. It’s commented fairly well so I wont go into it too much but a high level overview is that the flies must optimise towards the numpy array called ‘target_solution’, and they have reached the global minima or the optimal position when the flies position vector returns zero Euclidean distance from the target solution.
What About Something Harder?
So obviously that’s a fairly straight forward toy problem, so what about something a little more complex?
In lieu of using Reinforcement Learning, I thought an interesting way to apply DFO would be to optimise a neural network’s trainable weights to complete the cartpole problem. The cartpole problem is a standard Reinforcement Learning literature task, and the OpenAI gym library provides a Python cartpole environment for our neural net to interact with. The problem is as described on their website:
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the centre.
So above is the network in action, running for 200 timesteps, and below is the code to reproduce it. Once the network has successfully balanced the pole for 200 steps, we consider the problem solved. I should add here this is not the place to learn about neural networks or the computational library TensorFlow — I will make no effort to illustrate what is going on with the network or TensorFlow specific code. A good place to learn more about neural nets is here and you can start with TensorFlow at their official docs. Have a look through, and then I’ll break down the code below in the last section of this post:
So WTF Are These Changes?
So when I was hacking around with DFO, to try to improve the yielded results, a bunch of changes were made to the main DFO algorithm. Again, I wont cover the necessary boilerplate code required for plotting or representing neural nets! In summary the most notable changes made to DFO were:
- Method of updating (seven in total.)
- Elitism of flies.
- Decaying disturbance threshold.
- Multithreaded computation for quicker fitness estimation.
- Eternal god fly.
Updating
There were seven updating methods made available in the above code, and combined with the other modifications, there were quite a few modifications to try out. I’ll aim to cover most of them, but you can read them in the code from the above notebook.
The original method of updating was slightly changed to randomly reinitialise the flies positions using a gaussian distribution with a mean and deviation derived from the networks original weight settings rather than a normal distribution. Anecdotally, this seemed to deliver a better performance.
Another updating method was dubbed the hybrid mode.
The major difference here is the fly is not assigned it’s best neighbours position, but retains its own and moves towards the average of the best neighbour and the global best. This in practice didn’t lead to very good results as the average of the two is not necessarily optimal — there are many configurations of network parameters that could work, which means the problem is multimodal, not unimodal and the mean of two optimal wights may not necessarily score well at all.
The next step up from the hybrid mode was the n_fittest mode.
This code assigned the fly the best nearest neighbour and moved it towards the average of the ‘n’ fittest flies, which in the below graph, ‘n’ equals 20. It never optimised in time but most interestingly the populations mean and deviation never really changed after the first iteration.
There was also a couple of random modes, such as random_uniform or random_gauss. These modes just randomly assigned values using the specified distributions to the flies. This was generally less successful searching for larger networks weights, but had plenty of success in the lower dimensional search space.
Elitism
Elitism was introduced to ensure the fittest individuals were left just as they were.
In the code, the variable ‘elitism’ is an integer setting how many of the fittest population should be safe from updating. The indices of the ‘elitism’ amount of fittest are then stored in ‘n_fittest’ so that the program knows not to update them.
Above we can see two run throughs of the program, one with elitism set to 20, and one set to 0. For both runs, the mode was set to the original DFO update method and the population size was 1000, and the disturbance threshold was set to 0.1- which is very high!
The elitist approach took longer to find the best reward which is likely due to the best scoring flies being forced to stay at local minima, rather than stochastically optimising. The program that had no elitism quickly updated the flies to the global minima, but had problems retaining the global minima position every iteration, as the best flies were never safe from being updated.
Disturbance Threshold Decay
Another idea I tried was reducing the disturbance threshold by a specified amount for each iteration. The rationale behind this was to reduce the amount of random position resets that happened as the flies optimised closer to the global minima. In practice however, this is just another parameter that is hard to configure without intimately knowing the problem at hand.
Multi-Threading
So when you have to wait for a population of 1000 complete many steps of the cartpole simulation, it can get a little boring waiting for the environment to finish. A quick hack to to make the simulation multithreaded, which was managed mostly by this code here. As many threads are made as computing cores that are available are made, and it led to massive speedups.
Eternal God Fly
Okay so this is a silly name (it’s late and I’m getting tired) assigned to the fittest fly of all time during a given run. By providing a memory of the best fly, we ensure that the best network is always available to test. Note I use the ‘≤’ operator — this us due to us wanting to keep a newer network that has more ‘experience’ against environments, compared to a luckier earlier iteration.
Thoughts On DFO’s Performance For Neural Weight Updates
DFO had a noticeable decrease in performance when the dimensionality became very large. For example, trying to optimise on a network with 256 neurons in each layer would be practically impossible to solve in a sensible time. Perhaps if it could run on a GPU with a similar amount of time lended to an algorithm like backprop or policy gradients with experience replay then it would be a fair contest though. However, as per the supplied video, with enough tweaking DFO did eventually manage to get a working network that solved the problem!
In the No Free Lunch post we looked at the relationship between an algorithm and it’s search space. It is looking at this that will determine the performance of an algorithm for a given problem. So how well does DFO fit with its search space of neural network weights?
I would argue that the DFO is not best suited to the problem at hand. Because of the regular updates to the weight positions and the non-linear relationship of the weights, it seems like this is a poor manner to improve towards a global minima. The most grating thing is the random resets to flies positions. This easily means an essential weight can be reset to an entirely random value and cause a massive, systematic chain reaction of bad output from the neural network.
That being said, with enough tweaking, it did work, and there are likely many further improvements that would better the performance.
Concluding Remarks
I’d like to wrap up with this great quote from AI heroes Stuart Russell and Peter Norvig:
The quest for “artificial flight” succeeded when the Wright brothers and others stopped imitating birds and started using wind tunnels and learning about aerodynamics.
In short, finding inspiration in nature for novel algorithms is a useful insight; a great lesson to learn when considering the daunting task of deriving novel algorithms to solve arbitrary problems. It is also integral to remember that we only learnt to fly once we stopped creating machines with feathers and flapping wings, so we must remember to sprinkle a little fiction where necessary and come up with the relevant maths or computations to obtain the good results.
Thanks for reading!