How a Self-Taught Programmer Taught Himself Machine Learning in 10 Weeks

Context

Paper Club began in 2017 as a group of friends with a shared interest in the trendy fields of AI and deep learning. Over the next year and a half, we eagerly dove into the latest research on neural network theory and architectures. We took courses, read papers, and fiddled around with Kaggle competitions.

As we continued down this path, some concerning patterns emerged:

  • Seemingly every few months, the field would change dramatically. New results would be published that beat the state-of-the art and the authors would propose several reasons why. Then, sooner than later, counter-evidence would emerge and chaos would ensue. It didn’t help matters that the rigor and reproducibility of most research left a lot to be desired. Keeping up with everything felt like a big game of whack-a-mole. This may be the spirit of academia and research, but for a group of relative beginners, it felt like an inefficient learning path, to say the least.
  • For the few universally-accepted pieces of knowledge we were able to extract, retention became an issue. It felt like I had been dropped into the middle of a story and had nothing to anchor the new plot points to. This experience felt eerily familiar to me as a self-taught web developer who eventually learned the underlying computer science fundamentals. The fancy new stuff was exciting, but without a foundation to build upon, there was a definite ceiling to how much I could learn and retain.
Deep Learning was chaos. Photo by Peter Nguyen on Unsplash

And so, last Fall, we set out as a group to address this issue by taking a step backwards and learning “classical” machine learning together.

As a general rule, the algorithms that fall under “classical” machine learning are older and more interpretable than deep neural networks. My previous assumption was that these tradeoffs resulted in a massive performance downgrade — after all, why else would people be so excited about neural networks?

It turns out that for all but the most cutting-edge workloads (think Google/Facebook scale), classical machine learning algorithms are deployed very widely and to great effect. And after spending countless hours debugging neural networks where the issue could be in one of a million places with no indication of where to begin, I could certainly see the appeal of an interpretable and fast, if slightly-worse-on-paper, model.

Planned Curriculum

After doing some light research, we decided that we would pull from two primary machine learning resources:

as well as a few one-off pieces [misc]. Having had the most previous machine learning experience, Jason Benn was our spirit guide on this journey. The original plan looked like this:

  • Chapter 2 End-to-End Machine Learning Project [hands-on]
  • Chapter 3 Classification (precision/recall, multiclass) [hands-on]
  • Text feature extraction (from sklearn docs) [misc]
  • Chapter 4 Training Models (linear/logistic regression, regularization) [hands-on]
  • Advice for Applying Machine Learning [coursera]
  • Chapter 5 SVMs (plus kernels) [hands-on]
  • Chapter 6 Decision Trees (basics) [hands-on]
  • Chapter 7 Ensemble Learning and Random Forests (xgboost, RandomForest) [hands-on]
  • Chapter 8 Dimensionality Reduction (PCA, t-SNE, LDA) [hands-on]
  • Machine Learning System Design [hands-on]
  • (Google) Best Practices for ML Engineering [misc]

Jason picked these pieces to optimize for “real-world practicality and covering lots of interesting ground quickly”.

Completed Curriculum

We ended up deviating slightly from the original plan. This is a summary of the curriculum we went through. My learnings are split between notes and flashcards, as I was experimenting with different learning techniques throughout the course.

Chapter 2 End-to-End Machine Learning Project [hands-on]

This was a high-level overview of the eight steps of a machine learning project:

  1. Look at the big picture
  2. Get the data
  3. Discover and visualize the data to gain insights
  4. Prepare the data for ML algorithms
  5. Select a model and train it
  6. Fine-tune your model
  7. Present your solution
  8. Launch, monitor, and maintain your system

It set the stage for the rest of the curriculum.

Chapter 3 Classification (precision/recall, multiclass) [hands-on]

Classification is one of the two main types of machine learning tasks (regression being the other). This chapter walked through the creation and, more importantly, evaluation of classifiers trained to identify handwritten digits.

Chapter 4 Training Models (linear/logistic regression, regularization) [hands-on]

This chapter of Hands-On ML covers linear and logistic regression, a simple but extremely powerful set of tools for dealing with continuous data.

Text feature extraction (from sklearn docs) [misc]

Here, we took a quick break in between the super-basic machine learning techniques and slightly more technical ones. ML algorithms only take numbers as inputs, but some of the most useful ML tasks start with a text dataset. Understanding how to transform text into numerical features using techniques like bags-of-words and TF-IDF opens up all of these tasks at a low cost.

Chapter 6 Decision Trees (basics) [hands-on]

Covering decision trees, mainly as a stepping stone for random forest; in practice, random forests are preferred in almost every situation, but you don’t want to be the person who can’t see the forest for the trees.

Chapter 7 Ensemble Learning and Random Forests (xgboost, RandomForest) [hands-on]

In Jason’s humble opinion, the most useful set of machine learning techniques today. This chapter covers ensemble learning methods where multiple models are trained slightly differently on the same task and are aggregated to make a final prediction. These include gradient- and adaptive-boosted models as well as the aforementioned random forests.

Chapter 8 Dimensionality Reduction (PCA, t-SNE, LDA) + AirBNB paper [hands-on]

The curse of dimensionality comes up over and over again for even the most slightly non-trivial machine learning problems. Being able to recognize and address it is a cheap way to bootstrap faster training speeds and better model performance.

James also found this AirBNB paper about how they applied deep learning to their organization, and it was compelling to read through and discuss with the added context of our machine learning studies.

Chapter 5 SVMs (plus kernels) [hands-on]

After some back-and-forth about the present-day usefulness of Support Vector Machines, we decided to go ahead and spend a week on this chapter. It turned out that, while backed by some interesting concepts, they seem to be too inflexible and easily replaced by other techniques.

(Google) Best Practices for ML Engineering + implementation [misc]

For this section, we took a read through Google’s published Rules of Machine Learning guide and each chose a pet task to implement one of the machine learning techniques we’d learned about. This was a chance to tie a bow around the last ~10 weeks of learning.

Notably, you’ll see that we decided to drop the Andrew Ng Coursera course from the original curriculum. After trying to get into the few lessons we had planned, it turned out that the lack of tree-based methods and the use of Octave instead of Python in the class were not worth the extra effort.

Takeaways

If you recall, the two issues we had with deep learning that prompted our foray into classical machine learning were: a) constantly-changing best practices and results, leading to b) an inability for beginners to accurately judge useful and practical information.

I’m happy to report that these few months more than adequately addressed those concerns.

It’s hard to overstate how refreshing it was to work with the SciKit-Learn library after spending so much time wrestling with scattered and poorly integrated deep learning infrastructure. The Pipeline and Estimator abstractions are breathtakingly beautiful. It turned out that for a majority of workloads, most of the effort was in the data exploration and preparation steps, and a mere `estimator.fit(X_train, Y_train)` was all that it took to get results and iterate.

Taking the pulse of machine learning practitioners online and in-person, I get the sense that random forests or gradient boosted models are a perfectly acceptable, and often preferred, model choice for 90%+ of real-life workloads. The speed and interpretability benefits are significant, especially when you can take the opportunity cost of fiddling with a poorly-defined deep learning architecture and apply it to augment and improve your dataset (which is much, much more likely to lead to better results).

Bottom line is, I feel like a much more confident and prepared machine learning practitioner with these tools in my toolbox.

Suggested Curriculum

So, if I were to do it all over again, what would I do differently? Or — if you are also someone who got caught up in the deep learning craze and is interested in putting down stronger machine learning roots, what path would I suggest?

In general, I’m pleased with the amount of ground we covered, as well as the outcome. But as tends to happen with these sorts of things, our FOMO got the best of us and a lot of fluff made its way into the study plan (this is not inherently bad! But you should choose your own fluff).

As a rule, I would make the curriculum much more practice-heavy, replacing less-useful theory pieces like SVMs and dimensionality reduction with opportunities to actually use those techniques. This outline includes the highest-value chapters in Hands-on ML, and marks explicit time to apply your new knowledge on Kaggle:

  • Chapter 2 End-to-End Machine Learning Project
  • Chapter 3 Classification (precision/recall, multiclass)
  • Chapter 4 Training Models (linear/logistic regression, regularization)
  • Kaggle competition submission with logistic regression
  • Chapter 6 Decision Trees (basics)
  • Chapter 7 Ensemble Learning and Random Forests (xgboost, RandomForest)
  • Kaggle competition submission with Random Forest
  • Kaggle competition submission with gradient boosting
  • Your own project!

In the course of working on the Kaggle competitions, you will invariably stumble upon gaps in your knowledge or interesting sidebars to pursue; these will inherently be higher-value than any other prescriptive assignments that are laid out for you.

I would suggest choosing completed Kaggle competitions, since those are more likely to have published high-performing approaches which you can compare your own solution to.

Finally, for “your own project”, I would start with Eduoard Harris’s advice: build a project with an interesting dataset that took obvious effort to collect, and make it as visually impactful as possible.

Cheers, and may your models always converge!