ChiPy Mentorship Blog Part 3

(aka The Best ChiPy Mentorship Spring 2017 Blog Post Ever — Part 3!)

Welcome to the last official blog post for the Spring 2017 ChiPy mentorship program. My project is not yet completed, but I am happy with what I have learned so far. When I started this project, I knew nothing about computer vision and nothing about cloud computing. I had a minimal understanding of machine learning. I certainly didn’t expect to submit kernels for a Kaggle competition. Since March, I’ve gained knowledge and exposure to all of these things and I feel a lot more stronger and confident in my ability to figure out things on my own.

I am extremely grateful to my mentor Nolan Finch. He guided me through some rough technical patches. He encouraged me to participate in a Kaggle competition. I know so much more as a result of his attention and from the support of other friends I’ve made since joining ChiPy.


Last time on The Best ChiPy Mentorship Spring 2017 Blog Post Ever …

We saw how to remove unwanted regions from an image file using filters and how to extract (partial) images of sea lions using blob detection.

Today, we’ll look at Google’s cloud environment and how to set up a neural network using Keras.


Happy Little Cloud

Last time, I was confronted with the question of how to deal with a data set that’s 96 GB. I only have about 200 GB of free disk space on my machine. So I really don’t want to give up almost half of my free space for one data set. Also, just for fun, I started to download it and my machine informed me that it would take over a day to finish downloading. Finally, suppose I successfully downloaded these files, then any program I wrote and ran on these files would probably take too long to finish.

The solution is to not work locally (i.e. on your machine), but to work in a cloud environment. There are several choices for cloud computing: AWS, Azure, …etc. I chose to work in Google’s Cloud Platform because my mentor Nolan pointed me towards Stanford’s CS class CS231n website which has a detailed tutorial on how to set up an environment for cloud computing. It’s very helpful and takes around an hour to complete.

One caution is necessary. Cloud computing costs money. One reason why I chose Google is that they offer a $300 credit to use within one year. When you use your google cloud engine (gce)(aka an ‘instance’), you are charged for the resources you’ve attached to that instance while that instance is running. It is important to remember to turn your instance off when you are done, otherwise you’ll continue to be charged. Before you know it, you’ll have run out of credits and will have to pay out of pocket for your usage.

Once I set up my gce, I was able to install all of the libraries I needed using thepip command. Transferring files to my gce was almost painless. In my first attempt, I used wget as follows:

wget https://www.kaggle.com/c/noaa-fisheries-steller-sea-lion-population-count/dataKaggleNOAASeaLions.7z

where the above url leads to the data set I need. When I hit Enter I didn’t have to wait. I checked my (gce) directory and there was a file named KaggleNOAASeaLions.7z. I thought this was too fast of a download for such a file, so I checked the size and it was only 15 KB. When I opened the file, it had the content of the download page, but not the file itself. After some searching, I realized that only users registered on Kaggle are allowed to download files. This meant that I needed to provide my login information as well. So I tried:

wget --user=USERNAME --password='PASSWORD' https://www.kaggle.com/c/noaa-fisheries-steller-sea-lion-population-count/dataKaggleNOAASeaLions.7z

where USERNAME and PASSWORD should be replaced by the appropriate information. When I tried this, now I had to wait. I was confident that this was working. After about 45 minutes, the process terminated with an obscure error message. I found a partially downloaded file in my directory and after a bit of thought realized that my gce could only hold 10 GB. I needed to increase my disk size to accommodate such a large file. To play it safe I increased my size to 200 GB. I could have increased the size to 1 TB, however you are charged extra for any resources you increase or add to your gce (whether you use them or not), so you want to make sure you only add as much as you need.

Now when I entered the above command, I waited for an hour and got no errors. I assumed it would take a while to complete the download so I went out. When I got back (about 4 hours later) the download was complete and I finally had my data set!


Next Steps

Now that I (almost) have the processing power I need, I want to build a neural network that will search the image files for sea lions and keep count and classify them. To do this I am using TensorFlow, an open source library that does the type of numerically intensive computations needed for my neural network.

In addition, I need to add GPUs to my cloud instance. As it stands, my instance is not powerful enough to run my code.

Once I’ve jumped these hurdles, then I will need to refine my algorithms. At Nolan’s suggestion, I’ll be looking at Exif metadata for the image files to see if I can tweak the algorithms.


Conclusion

This mentorship has been an incredibly positive experience. Even though the mentorship almost over, the learning is ongoing. I hope that I will be able to serve as a mentor at a future mentorship.

Last but never least …

WATCH TWIN PEAKS!