Galaxy Zoo classification with Keras
The full code for this is available on my Github at https://github.com/jameslawlor/kaggle_galaxy_zoo.
I am currently a few weeks into Jeremy Howard and Rachel Thomas’ excellent fast.ai MOOC on practical deep learning. I’ve done a couple of neural networks courses in the past and while I understood the maths and theory I didn’t have much practical experience with staple tools like TensorFlow and keras. This course fills my knowledge gap perfectly with a “top-down, code-first” approach that is quite refreshing.
So to get straight down to business: the homework for Lesson 2 of the course was to create a Deep Learning model for a dataset of our choosing. I chose the Kaggle Galaxy Zoo competition because space is pretty cool. The competition was live about 4 years ago though so I’m a bit late to the party! In this post I’ll run through my solution that achieved an RMSE score of 0.12875, which would have put me in the top half of the leaderboard. Not bad for a couple hours work. There is of course plenty to improve on and experiment with which I’ll talk about at the end, but I’m happy with the result as I’m now much more familiar and comfortable with keras. I didn’t find any code or walkthroughs for the Galaxy Zoo competition with keras on the first few pages of Google, perhaps because the competition is about 4 years old and keras was only released 2 years ago, so hopefully this will be useful for someone attempting the problem.
There are around 60,000 images in the training set and 80,000 in the test set, each image is a 424x424 colour JPEG. You can see from inspection of the images that only the central part of the image is useful. I decided to crop to 212x212 around the centre and downsample to half resolution to reduce the number of parameters we’ll have to tune.
The code to do this is below:
Another consideration is that we have a huge volume of training and testing data, far too much to load into GPU memory. I coded up some batch generators to sequentially grab a batch of the images, run them through the image processing code, and then pass their decoded data to the convolutional neural network (CNN). Here’s an example of the batch generator code
I decided to go with a VGG16-like CNN architecture — that is a bunch of Convolutional/Max Pooling layers followed by some large FC blocks and a final 37-class sigmoidal activation layer for predicting the galaxy class probabilities. I chose this architecture because VGG16 is known to do very well at image problems and the consensus online seems to be that it strikes a nice balance between ease of implementation, training time and decent results. There is a way to load a pre-trained VGG16 within keras with a simple import, which you can then adapt to your problem via finetuning, but I opted to do things the hard way by building and training from scratch. So you don’t think I’m completely crazy wasting money on GPU time, my motivation was that I wanted to see how a ‘fresh’ neural net would perform in the problem. I reasoned that because the features and classes we want our network to detect and discern in the galaxy dataset aren’t huge compared to ImageNet, which the ‘real’ pre-loadable VGG16 was trained on over many weeks and contains objects from dogs or teeth or battleships. Therefore we shouldn’t need to train for long to get good results as the variance between inputs is not huge. I’ll try and visualise some of the model filters to test this idea and edit this in at some point when I have time. This is a good example of what I have in mind https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html.
Based on this idea of model simplicity it’d be interesting to see how my model would work if we removed or reduced some of the layers. I have a hunch that the full VGG16 architecture is probably overkill for this problem, but YOLO as the age old saying goes.
Here’s the code that implements the architecture:
I used the RMSProp optimiser with a learning rate of 1.0e-6. Here’s how the training looked, I held out 4000 images from the training set for validation in a separate folder.
Training took about 90 minutes for 42 epochs (with early stopping) on an Azure NV6 machine with 1/2 of a K80 GPU.
This model achieved an RMSE score of 0.12875, which would have put me in the top half of the leaderboard with only a few hours work from starting coding to Kaggle submission. There are a lot of ways the model could be improved which I might come back to, particularly the numerical predictions — these are not typical probabilities but weighted between different classes. The straightforward sigmoid for each of the 37 classes is naive and doesn’t capture this condition. I’d also like to experiment with simpler architectures, improved image processing and much more. I may come back to this at some point as I did quite enjoy it, but for now I’m focused on finishing up the fast.ai course.
If you’re interested in adapting my methods the code is up on my Github at https://github.com/jameslawlor/kaggle_galaxy_zoo. It’s also worth checking out the writeup by the competition winner, Sander Dieleman, which can be found on their blog over at http://benanne.github.io/2014/04/05/galaxy-zoo.html.