The Surprising Truth About What it Takes to Build a Machine Learning Product

When I was in college, an ice cream shop opened nearby, and a few friends and I went to check it out. We walked in, and it looked completely normal — they had all the usual flavors like mint, chocolate, and the like. However, at the end of the counter, they had this flavor called “The Broccoli Surprise”. A naturally curious individual, I had to try it. I asked the attendant behind the counter for a sample. It was white with little green specks, and it tasted sweet, creamy, and rich. I was confused — there was no broccoli flavor in here. So I asked, “what’s the surprise?” “There’s no broccoli,” she replied with a smile.

Machine learning (ML) has a surprise, too. One of the biggest misconceptions about ML deployment within organizations is comprehending the difficulty and the value.

Integrating ML into your business workflows can be broken down into five activities:

Defining KPIs — Key Performance Indicators allow us to measure and discuss what we are trying to improve. Common KPIs include customer retention, manufacturing yield, or employee turnover. Setting KPIs is a critical step in Machine Learning since they ultimately drive optimization along the way to a performant model.

Collecting Data — Collecting the data that will be used to train your ML algorithms. Yes, you could use ML models others have produced if you lack data. However those business considerations are similar to other SaaS offerings, so let’s leave them out of scope here.

Infrastructure — ML infrastructure includes various pieces of software: data management, annotation tools, model training, and testing environments. This infrastructure is an upfront investment, but makes iterating and improving the model and data set much more efficient.

Optimizing ML Algorithm — Here we consider factors like which model to use based on a given data set/problem, the amount of necessary training data, the layers in your neural net, and hyperparameter tuning. There are a plethora of choices.

Integration — Getting an ML model working in a vacuum is a great achievement, but it is not until the model is integrated with a real workflow that it starts to create a tangible business impact. Integration is the process of building pipes and structure which seamlessly pass information and data between users and computers.

Based on many conversations with companies interested in deploying machine learning, there is a high perceived effort required in, and pay off from, optimizing a machine learning algorithm.

There are a few possible reasons for this:

  • For most practitioners, optimizing ML models is the biggest “unknown” in the stack, so it’s easy to imagine it being more complicated and time-consuming than it really is.
  • Availability Heuristic — since ML algorithms and optimization are talked about more in literature and media, it is common for people to assume that they play larger roles than they do in the actual implementation process.

The Surprise

When I talk to practitioners that have had a lot of experience building and scaling these ML systems inside Google, I hear a very different story. Based on these conversations, optimizing an ML algorithm takes much less relative effort, but collecting data, building infrastructure, and integration each take much more work. The differences between expectations and reality are profound.

Defining KPIs — once we deploy data-driven systems, we spend less time and organizational resources selecting KPIs since there are constant streams of data feedback. This obviates the need for proxy KPIs. Since good ML is predicated on good data, we must have a great collection pipeline already in place.

Collecting Data — Collecting data is almost always an underestimated component of spinning up an ML project. Some factors to consider when building a data collection and processing strategy are described in a previous post.

Infrastructure — Infrastructure building, which is mostly a software engineering task as opposed to an “ML task,” is one of the most time-consuming parts of most projects.

Optimizing ML Algorithm — The task of training and optimizing ML models almost always takes less time and effort than anticipated for two reasons. First, performance is a strong function of what data you possess. Tweaking algorithms yields benefits, however, pales in comparison to cleaning up your data. Second, tools for optimizing ML algorithms (like AutoML) make it much easier and faster to train and optimize models based on labeled or unlabeled data.

Integration — integration is another underestimated part of the ML deployment process. Error and exception handling, redundancies, and the challenge of moving from a static product to one of continuous iterations presents a host of software, product, and engineering challenges. Just think of all the technical debt hidden inside of your training data!

— —

ML actually has two surprises.

First, many companies are wrong about which parts of the ML implementation process will be difficult. Tools and technical advances are dramatically changing ML optimization at a rate unmatched by software infrastructure for brute force data collection and management. Like the broccoli ice cream — there is usually not that much ML in an end-to-end ML system.

Secondly, the path of implementing ML (asking questions about your customers, building infrastructure to collect, interpret and act upon that data, etc.) is valuable, regardless of whether or not ML is actually implemented in the end. Not every problem has an ML-powered solution, but many do, and even those that do not will benefit from this journey.

Josh Cogan is a Tech Lead and Manager in the Cloud AI group at Google. He is passionate about intersection of business, history, information theory and ML.