The Machine Learning Playbook For Data Engineers

The Machine Learning Playbook for Data Engineers

A series on the knowledge required to transition from a data engineer to a machine learning engineer

2 min readFeb 21, 2023

In 2016, I was a software engineer at a top trading firm and struggling to break into machine learning. I had recently obtained a masters degree in ML from a prestigious university, but I was finding that the content I had learned from my degree was not sufficient for passing an ML engineering interview. I was faced with two options — abandon my ML dreams and stay in my comfortable data engineering role, or persevere and break into the ML industry through sheer force of will. I chose the latter, and five years later I was a senior staff ML engineer, leading the ML team at a publicly traded tech company. This is how I did it.

Intended Audience

This course is for experienced infrastructure and data engineers who want to break into machine learning. The ideal reader should have already taken a few ML courses (my suggestion: Andrew Ng’s first two courses), but is struggling to pivot to a career as a ML engineer.

While many other “Intro to Machine Learning” courses exist and are a necessary prerequisite, they do not sufficiently teach you the skills needed to become an ML engineer. A common saying among ML practitioners is that “80% of the job is data wrangling”. However, no playbook exists to teach the engineers who are already experts in this 80% the skills necessary to be effective in the remaining 20%… until now.

Syllabus

Just as the material you learn in a computer science course is only tangentially related to the tasks you face as a software engineer, the material you learn during an “Intro to Machine Learning” course is only tangentially related to the majority of work you do as a ML engineer. This is because other ML courses take a breadth-first approach and teach a variety of different algorithms (starting with linear models, moving to tree-based methods, and ending with neural networks), without ever teaching all of the steps needed to successfully deploy any model to production. In this course, we will instead remain model agnostic, and walk through the steps to deploy any model. And throughout, I’ll throw in tips on how to use this content to ace your next ML interview.

Chapter 1, Create a Pipeline: Start simple, determine metrics, and iterate
Chapter 2, Build a Dataset: Clean, transform, validate, and store data
Chapter 3, Model Training: Speed up training and calibrate results
Chapter 4, Offline Evaluation: Ensure model quality before releasing to production
Chapter 5, Model Inference: Real-time and batch architectures for serving results
Chapter 6, Model Monitoring: Detect and fix issues as they occur
Chapter 7, Ace the Interview: Common questions and pitfalls during ML interviews