The Physics of Machine Learning Engineering

Published in

Outbrain Engineering

6 min readOct 5, 2020

I’ve spent almost a decade with one foot in engineering and the other in data science. Watching engineers and data scientists discuss how to get stuff done often got me into thinking that the two disciplines exist in totally different universes, to the point of not believing in the same laws of physics.

I’d risk a gross generalization, but it seemed to me that software engineers lived in Einstein’s deterministic universe, while data-scientists inhabited Bohr’s inherently random universe. While engineering has evolved its ability to predictably deliver working solutions, data science's work is open ended by nature and is subject to discoveries that might derail a project completely.

A “unified field theory” is needed to bind those fundamental forces (engineering and data-science) into a single process for successfully delivering machine-learning solutions consistently and with scale.

This post is the first in a series that aims to describe Outbrain’s approach to repeatable data science and the tools we use to achieve it.

From Theory to Observations

Presenting such a theory is beyond my skills (hey, even Einstein fumbled with it!), but let’s start from several key observations.

#1 Machine learning IS software engineering
A successful journey from raw data to serving a quality machine-learning model in production requires software. From data pipelines to ETL processes, from model selection, feature engineering, and model training to model deployment — software is key. There is no difference between the code that drives your serving stack and the code that does automatic feature selection, even though the former is owned by the engineering team and the latter is owned by the data-science team. Both projects should follow your organization’s software development best practices — from testing and code-reviews to committing early-and-often. Data-science code is code! Once you start treating it as such, you realize all those practices followed by the engineering organization (developer testing, pair-programming, CI/CD, MVPs, etc) apply for data-science projects as well, increasing productivity in the long run.

#2 KISS your ML stack
KISS stands for “Keep it simple, silly”, “keep it short and simple”, “keep it simple and straightforward”, or “keep it simple, stupid”. The message is clear — simple software solutions are easier to develop, maintain, and reason about. Complexity — especially unintentional complexity (overly-complex solutions to simple problems) — is a velocity killer. Since research projects are open-ended by nature, with the added complexity of ML modeling in place, we should strive to eliminate complexity as much as possible to allow maximum velocity. Unfortunately, many ML projects begin with an all-or-nothing approach, building the state of the art hoping to reap phenomenal results in the long run. I have watched several such projects screech to a halt, a year in the making. Complexity should be added only after all low hanging fruits are reaped. To allow that, your ML stack should be as simple as possible. Now go KISS your ML stack!

#3 Simple algorithms, solid pipelines
You’ve looked at data, built your pipeline, deployed the model. Now, your model is in production outputting weird results. Is it the algorithm? Is it the data? Is it the pipeline? You might spend weeks debugging the wrong problem. Trust me — you don’t want to be in this position as an engineer, data-scientist or engineering manager. Coupling a weak data pipeline with a complex algorithm is a sure recipe for disaster. Pick the simplest algorithm that suits your ML needs — and I mean the simplest. Linear/Logistic regression can be a powerful tool if you give it the right data. Make sure your pipelines are solid. Use the same online prediction code on your offline data to validate your pipeline. Log your online state and search for data discrepancies between online states and offline data aggregations. Build automation around these checks. I know cutting-edge machine learning is what you want to do — working incrementally builds intuition and confidence while getting you there. Want value delivered consistently? Use a simple algorithm and make sure your pipelines are solid.

#4 Make the Kessel Run in less than twelve parsecs
Investing in speed really pays off. Just ask Han Solo. Or our data-science team. Yes, the Millennium Falcon may not look like much, but it’s fast. The iterative model development model focuses on an initial, simplified implementation, which then progressively gains more complexity and a broader feature set until the final system is complete. The same can be applied to data science. Adding features to the pipeline, feature engineering, selecting models, tuning model parameters and model deployment — faster iterations through these steps allows for quicker KPI observations, leading you to the next step toward the next performance improvement. It also allows you to pivot painlessly. If each model deployment takes a month, half a year will pass and all you have are six data points. If those points average to a small lift (or no lift), you might be tempted to continue along this path (because a pivot is painful), realizing after a year that you are left with no real improvement — and a large ML stack to maintain. Iterating faster (one test per week, for example) gives you the same results after six weeks, allowing you to consider a pivot almost painlessly. A fast-running build-measure-learn cycle for data-science is crucial for delivering incremental value in the short run, building toward the long run. You need to kiss a lot of frogs, faster and faster, to find your KPI boosting princess. Build the software you need to make it happen.

From Observations to Practice

Building on the 4 observations above, we at Outbrain have built and deployed our own AutoML framework treating it as a software engineering project from the get-go. AutoML isn’t a Silver Bullet — it’s not the point here. The point is the way we decided to approach this project — an approach that made it a huge success:

The goal was described as creating an MVP for AutoML and Online Learning in 3 months*.
We treated this project as a pure software engineering project from the get-go, using all engineering best-practices Outbrain is known for. And, no research!
We started with the simplest algorithm possible that matched our problem (in our first case, Logistic Regression for click prediction).
We made sure the pipeline — from raw data to training data to prediction and prediction validation — is rock solid, eliminating discrepancies after making sure we knew exactly where they came from.
We iterated as fast as possible. The first working version (that provided a lift!) took 3 months from inception to production. More lifts (and new use-cases) came at a stunning rate soon after.
Only when we exhausted all our simple tricks, when lifts plateaued, it was time for cutting-edge…

From Practice to Practitioners

“Everything should be made as simple as possible, but no simpler.”

This is one of the greatest quotes in science. Coming from Einstein, who paved the way for General Relativity, it is a great statement of how to conduct science. This is why I believe the physics of machine learning is identical to the physics of software engineering. Bridge the gap, and you’ll get a high-performance team capable of delivering quality machine learning solutions consistently and with scale.

Our next posts will take a closer look at our AutoML toolset and some of the advanced machine-learning algorithms we incorporated into our flows.

* 3 months to a working (and lifting) MVP in an environment in which mistakes cost money is a good starting point. Later iterations were much faster (a week in some instances).

The Physics of Machine Learning Engineering

From Theory to Observations

From Observations to Practice

From Practice to Practitioners

Written by Ido Tamir