My M2M : August - Week 4

This month’s challenge is to build a project for my local science fair. The project must use a neural network or genetic algorithm and be at the standard necessary to win the competition.

Last week’s post ended with the promise of some benchmarks. Unfortunately, my MBP 2011 has no GPU, and as such training was taking a long time.

As in 2-weeks-to-train-the-model amount of time.

So I decided to wait until I got back to my normal workstation with a GTX 1070 GPU, in order to run the models again, in a more modest timeframe. However on this workstation, there was far more spare space on my SSD, so I downloaded the full data-set of clips that appear >5 times in the data-set. I had to re-write all of my scripts for image pre-processing (splitting into test & train sets, re-sizing, grayscaling etc. ) for powershell. I actually learned some pretty handy tricks in powershell which have made me love it almost as much as bash.

Once I finished all these tasks, I set to work crafting my algorithm. Luckily, Keras is the same on Windows as Mac, so there was no need to re-write my Convolutional Neural Network. After running this algorithm on my GPU, I was somewhat dissapointed with the results : an accuracy of 1.6% after 30 epochs.

Having said that, the increase in speed from using a GPU was so great that I had time to change my algorithm and reap better results. Also I finally had a figure to benchmark the rest of my results against.

My first theory was previously explored in one of my past M2M posts. A Convolutional Neural Network only takes advantage of spatial features. This means that my CNN only looks at each frame individually and makes a judgement on these individual pixels. It can’t relate one frame to the frames that came before, as humans do when recognizing sign language and most visual input. Relating the current frame to past frames is known as temporal features.

To take advantage of temporal features, I decided to implement an LSTM (Long Short Term Memory) neural network. This is a type of Recurrent Neural Network. But I also wanted to have my cake and eat it too, by taking advantage of spatial features. As such, my final network architecture was a Convolutional Neural Network which passed its final layer into a LSTM network. This is called an LCRN network.

While writing my LCRN network, I had some trouble passing sequences to my neural network. After, asking on stack overflow and a small bit of tearing my hair out, I’ve hit upon a promising solution. All videos were turned into 50 frames, which will then be fed as a sequence.

For the rest of the month, I’m going to experiment with LSTM’s and try and get the best possible result. I look forward to updating you with this month’s last post tomorrow.