Overcoming Challenges Working with Big Data — ChiPy blog 3
(This is part 3 of a 3 part series about my experiences in the Chipy mentorship program.)
When I started working with the TSA Kaggle dataset I knew there would be challenges (See this post for more info on the project). The project itself is challenging — 3D image recognition, 3 TB total dataset, weird file formats that can’t be opened by normal image visualization programs. There were some challenges I could anticipate before getting started. This is the first project I’ve ever done in Python, so I knew there would be a huge learning curve to just write basic Python code. The size of individual files is huge: 10 MB for even the lowest resolution images (the highest resolution images are 2.27 GB apiece). This will mean long run-times. The scope of the project is immense with 19,499 potential threat labels in the entire dataset. I don’t have a supercomputer at home (although that is my latest nerd fantasy), and I will mainly be constrained to a normal desktop computer. Also, it’s possible to run all models in the cloud, but fully training a model on the cloud could get very expensive.
So I decided to start small with a subset of the data (just 120 samples). Soon, I started running into challenges I hadn’t anticipated. I decided to use Keras with TensorFlow backend to develop the models. When I started downloading the packages I realized in the documentation that TensorFlow currently doesn’t have GPU support on mac, so I would have to run everything on CPU, which would be a lot slower. My attempted solution was to build TensorFlow from source using an unsupported version on GitHub that is compatible the mac graphics card. Everything broke. The mistake I made was to try building from source in my stable root environment, so a lot of backend dependencies were altered. Luckily I was able to override everything and fix it, but I learned an important lesson: always create a virtual environment if you want to try that so your stable root environment will be preserved.
Once I started developing my first model, the next challenge arrived. I was running everything in a Jupyter notebook, which was working great for data preprocessing and testing everything out. I started doing longer runs when I was training models in the Jupyter notebook too, but it kept disconnecting from the server and generally ran slowly. This problem was alleviated by copying and pasting the model and just running it in a normal .py file. Even when my computer went to sleep, it would keep running (slowly) or if it stopped, it would pause and resume when I woke my computer back up again.
The next challenge arose when running SciKitLearn’s GridSearchCV. GridSearchCV takes all possible combinations of designated parameters and run through each of these models to find the best combination of parameters with the best accuracy. This is great because you can designate some parameters and tune your model while you do something else in the meantime. It worked great in a test run I did on a straightforward business problem, but my computer ran out of memory when I tried doing it on my subset of 120 samples from the TSA dataset. All of the models were held in memory in order for it to determine the best parameter combination and best accuracy. This problem is likely related to my own constraints of computing power on my desktop and would be less of an issue on a more powerful machine. An alternative to this is to take an iterative approach to training the model where each model would be tested in sequence, but not held in memory.
The next major challenges I faced were in scaling up. I decided to use a subset of data from body zone 6 below.
I was scaling up from my original subset (120 of 1,148 total zone 6 images), and then a larger subset of zone 6 (573 of 1,148 images). Aside from the other challenges outlined above, things worked great. I started to run into problems when I scaled up to the entire zone 6 (1,148 images). All the images were read in and stacked into a giant x array. This worked fine in all other instances, but now it was too much — my code terminated before the x array was fully stacked. (Again I anticipate this would not be a problem with greater compute power). An alternative to this would be to move to an online learning, partial fit model while iterating though each image individually. It would take longer, but use less memory.
Connecting to Google Cloud Platform was another challenge and took some time. I was learning a lot of basics of programming and Python while trying to connect, but I also found the official documentation to be a little confusing. Someone suggested to me a separate tutorial for connecting to Google Cloud, which was a lot more understandable for me. I learned that sometimes it’s easier to do that than to use the official documentation. Another thing to consider is that there are limited Google Cloud free credits. Rates charged depend on both the compute power (overall speed) and the region in which you choose to run the code. A good strategy is to try to estimate the cost of a run you would like to do. My plan is to save all my free credits and use them all in 1–2 runs in the cloud.
Although these challenges I encountered could definitely be frustrating, they had a huge impact on my learning. By running into big data challenges even on a small sample size, I learned a lot of lessons I wouldn’t have otherwise on a more straightforward dataset. If you are just getting stated with deep learning, don’t be afraid to jump right in to a challenging project — it could be your best learning experience!