Welcome back y’all! It’s hard to tell from my writing, but that was said in a very thick American accent. All in the spirit of these blog posts in which I describe my experience attending the full stack deep learning bootcamp at UC Berkeley, California in the summer of 2018. This will be the second instalment. Find a quick overview of the entire blog post here, and the previous blog post here.
I will describe the most important things from each lecture / lab and include a small takeaways section that describes the most important things I learned from said lecture or lab. It’s definitely not meant as an exhaustive listing, but os rather focused on what I thought was most impressive and unique to this bootcamp.
Lecture 6 — Training & Debugging
The first day started off great (SPOILER ALERT: it was great throughout the rest of the day too) with this lecture on training and debugging your network. It discussed many of the issues you face as a deep learning practitioner, such as poor model performance. Which will often be caused by implementation bugs, the misconfiguration of hyper-parameters, poor model-data fit or badly constructed datasets.
A very powerful workflow was then provided to assess the issues in your machine learning project. Many parallels here can be seen with coding, with which I am personally far more experienced. When coding up an algorithm for example, you would start off with a few simple instances of the problem. Adding complexity as you go incrementally, so that you may get a better idea of what works and what doesn’t. If you start off with a giant algorithm or model in this case, it’s much harder to tell where the fault lies when it doesn’t work right at the very start.
A concrete example of this was provided with a simple architecture (LeNet for images), a default set of hyper-parameters, overfitting to a mini-batch of data, while possibly simplifying the problem.
For the actual implementation of your model three key steps were identified:
- Getting your model to run at all
- Overfitting to a single batch
- Comparing to a known result
The first one helps you figure out network construction issues such as incorrect tensor shapes, casting issues and out of memory errors. The second one helps you figure out issues with your loss functions, data labeling, or learning rate configuration. If your model cannot over-fit to a simple batch, there’s no point going on to the actual data set. This was very insightful and useful to me, especially as all the causes were meticulously described.
It also discussed some of the most known DL bugs that you could run into when even the overfitting doesn’t work, such as incorrect input normalisation or misconfiguring your loss functions.
Comparing to a known result relates to the baselines we learned about in Day 1. Whether your model is actually doing well depends not on your loss, which will often be quite arbitrary, not even on your accuracy per se, but on how its accuracy compares to other solutions. By finding other solutions and comparing to them you get a better sense of how well your model is actually performing. Possible known results were ranked from least to most useful: extremely simple baselines such as rudimentary statistical metrics (means, medians) as the least useful up to official model implementation (AlexNet, etc.) on a very similar dataset to yours.
There was a lot more interesting stuff here on finding over- and under-fitting issues by using bias and variance decomposition, methods of addressing these issues and strategies for finding hyper parameters.
Takeaways : Extremely insightful, practical and concrete lecture. Quite possibly my favourite one of the bootcamp. As with many practical guides you encounter, it all makes it a lot of sense, but it’s very helpful to have it all written down concisely and to see it backed up with argumentation.
Lecture 7— Infrastructure
GPUs, GPUs, Glorious GPUs. Ones you can touch, or a whole lot of them that you can’t, tucked away safely somewhere in the cloud. This lecture gave a quick overview of the entire infrastructure space, but focused mostly on this question: Should I have my GPUs on premises or in the cloud? Do I want easy scalability, far less system management issues or do I want to invest in my own hardware, reducing the costs significantly assuming your utility rate is going to be high enough.
A comprehensive overview of the players in both the on-prem solution space (NVIDIA, Lambda Labs) and the cloud one (Google, Amazon, Microsoft and Paperspace) was provided.
Cost consideration is often the most important one. Some interesting calculations were shown, that perhaps unsurprisingly showed the on-prem solution to be quite a bit cheaper. The down side being that managing your resources is non-trivial so they showed some solutions for this ranging from simple spreadsheet logging to using Docker instances of Rise.ml (a software built on top of Docker) to create and orchestrate containers, optimising resource utilization.
A whole bunch of other solutions were discussed to manage distributed training (Horovod) and software to manage and evaluate your experiments with (TensorBoard, Losswise, Comet.ml and Weight & Biases). Final part of the lectures discussed all-in-one solutions that provide many or all of the above mentioned solutions, such as Floyd, Paperspace, and CloudML. Interestingly enough, I didn’t see any solutions that you could run on prem. If you know of any please leave a comment!
Takeaways: Consider the costs of managing your own hardware, make a back-of-the-envelope calculation on your usage costs and see if it’s worthwhile the hassle of getting your own hardware, considering also the nice extra features cloud environments have to offer.
Lab 4 — Tooling
Back to work! In the tooling lab we continued working on our problem of classifying hand-written sentences using the EMNIST dataset. This time around we would set it up so that it would hook up to our Weight & Biases account we made before coming to the bootcamp.
I specialised in computer graphics & data visualisation back at TU Delft, so for a nerd I am pretty visual. I enjoy using TensorBoard but it definitely is pretty limited if you want to drill deep into your model metrics and figure out what’s happening. In comes W&B (see the gif above!) Basically for every run you would do it would upload certain metrics to W&B and it would visualise it for you. I especially loved the parallel coordinates plot. In fact I used it myself in my academic work (shameless plug). This kind of plot visualises the relationships between the data. It is especially adapt at filtering out data and showing correlations (correlations will appear as parallel lines!)
Lecture 8— Sequence Applications
This is a sequence. This is also a sequence. Sentences are sequences. As are audio files and basically anything time-related. This lecture discussed some advanced concepts of LSTMs like bi-directional LSTMs, attention and beam search. It also discussed the application of translation and audio synthesis. Very clear and concise lecture, but nothing you wouldn’t find across many other resources on the internet, so considering the word count, I will leave it at this.
Takeaways: Worth investigating if you want to know about the concrete applications of LSTMs, the details to consider, and how to solve sequence data related challenges.
Lab 5 — Experimentation
Here we got to use our fancy new visualisation tool to visualise some experimentation in a free form way. Sentences from the IAM dataset were given that constituted not just an image classification problem but also a sequential one. Using an LSTM and a basic CTC loss function we could try to solve it. Some pointers on how to improve this basic model were given. It was a nice way to get a feel for W&B and rather fun to see your changes being visualised in this way.
Guest Lecture 1—Andrej Karpathy
Fascinating talk on Software 2.0, Andrej’s theory of the new type of software that is not coded up explicitly but is rather automatically created through data using machine learning. Although not a new concept, image classification has gone from millions of lines of hand-crafted code to understand what a face is to much less deep learning code, that learns its own feature set to do a classification task, at much better precision. Very interesting about this talk was that this concept was extended to the entire software stack.
Andrej went on to mention a lot of hand-crafted code at Tesla was used to process the input from the car sensors (cameras, radar, IMU, etc.) to steering and acceleration output. This whole stack of code was slowly being “eaten” by software 2.0 code. Unfortunately no concrete examples were given of this type of code. Would love to have seen some examples! In any case, he did show the importance of labeling and how he sees the role of AI people as mostly facilitating those who label all the images. I found this a slightly depressing way of looking at the fancy AI field, but it’s hard to argue with it. As a bonus he showed us some fun examples of labeling issues at Tesla: weird road markings, strange traffic signs etc. Pretty hilarious and insightful.
Takeaways: Learning more about a philosophical view of where ML is headed was interesting, as well as seeing some real world examples of the issues Tesla faces was great. By the way, the talk is quite similar to this one, if you can’t wait for this publication.
Guest Lecture 2— Jai Ranganathan
Jai Ranganathan, a quirky little guy with a big career as Lead of Product at Uber discusses the challenges faced by his team when managing a model’s lifecycle, in the specific context of a project in automating the process of user complaints.
The example project shown was the COTA (Customer Obsession Ticket Assistant), where a tool was developed that leverages machine learning and natural language processing (NLP) to more efficiently process user tickets. The tool would help employees in resolving these tickets by giving suggestions on replying to users.
Most interestingly were the lessons learned and general pointers learned at each phase in the project.
Exploration: Identify the right problem to solve, and understanding if ML is actually a good fit for it.
Development: Due to the huge and still increasing space of possible ML solutions you are well advised to make an estimate in how you weight cost (compute time) versus accuracy. It’s important to keep up with the literature (they showed quite a few interesting cutting edge techniques) and to validate your results using visualisation
Deployment: Covered some interesting data engineering techniques. For example a really nice Spark pipeline was shown. The main difficulty here was that Deep Learning is still very slow, but that distributed DL solutions can definitely help.
Monitoring: Very important, but often over-looked. This deals with the fact that business is dynamic which means your data is dynamic, which means your models may become outdated, so it’s important to check for things like distribution shift of your data and retrain when necessary. Very important to keep your labeling process going and identifying edge cases where your model is still failing.
Interesting side-note, when we went for drinks later he also joined, and was claiming how the real difficulty lies in finding good data engineers rather than data scientists. In fact he exclaimed that data scientists are a dime a dozen, which I thought was funny considering these people are also quite rare, but possibly not that hard to come by for a company like Uber.
Takeaways: I think the Exploration, Development, Deployment and Monitoring paragraphs should cover it nicely.
Oh what a perfect day. Well nothing is perfect, but it was definitely very, very good. I felt that the bootcamp was really kicked into gear today, showing a lot of the stuff that many of us were craving to see, the practical nitty gritty stuff of debugging, a coverage of the most important tools around, how to build your own DL setup, an almost philosophical lecture from Andrej Karpathy on the future of software, and a very practical in-depth example project at Uber.