Some Things I Wish I Had Known Before Scaling Machine Learning Solutions: Part I
I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
(Core ML concepts + groundbreaking research papers and frameworks + AI news and trends) x 5 minutes, 3 times a week =…
Last year, I presented a series of talks at machine learning conferences about our experiences building machine learning solutions at scale. Part of the challenges are based on the lessons we learned building the IntoTheBlock platform. During that process, we quickly realized that many of our assumptions of machine learning apps were really flawed and that there was a huge gap between the advancements in AI research and the practical viability of those ideas. In this two-part article, I would like summarize some of those ideas that hopefully will result valuable to machine learning practitioners and aspirational data scientists.
There are many challenges that surface in the implementation of real world machine learning solutions. Most of them are related to the mismatch between the lifecycle of machine learning programs and traditional software applications. With some exceptions, traditional software applications follow a relatively sequential model from design to production. Machine learning models, on the other hand, follow a circular lifecycle that include aspects such as regularization or optimization that have no equivalent in the current toolset of traditional software applications.
Each of the stages in the lifecycle of machine learning solutions introduces unique sets of challenges that have no equivalent in the traditional software world. Some of those challenges are non-trivial or even paradoxical and can be encountered on different shapes or forms. Some of the key areas of challenges are summarized in the following figure:
The good news is that most of those challenges are solvable with the current generation of machine learning frameworks and tools. However, some of the solutions are far from obvious. Let’s look at some of the key challenges and solutions across the lifecycle of machine learning programs.
Some Hard Lessons About Scaling Machine Learning Solutions
Strategy & Processes
Planning and strategizing is a key element in the adoption of machine learning best practices, specifically in large organizations. During the strategizing phase, there are a few challenges that become very visible:
1)Challenge: Data Scientist Make Horrible Engineers
No offense to the data science community intended 😉 but most data scientists don’t tend to think about engineering capabilities such as code readability, testing or deployment. As a result, many of the models created by data scientists need to be heavily refactored in order to be operationalized.
The most successful organizations that I’ve seen address the data scientists code quality challenge allocate a specific team to operationalize models. That team is often referred to as data engineering and their responsibility is to refactor and sometimes even rewrite data science models to make them production ready.
2) Challenge: Neither Agile nor Waterfall Processes Work for Machine Learning
Agile and waterfall methodologies are the two biggest schools of thought when comes to software development. When applied to machine learning applications, waterfall models fall short as most of the requirements are not known upfront and estimating the time for creating a specific model is next to impossible. Similarly, agile methods fail as shorter iterations are often impractical for machine learning models.
Although I don’t claim to have any answers to the right methodology to use for machine learning applications, an approach that has been relatively effective is to dive the development processes both into segments that can be approaches using both agile and waterfall methodologies respectively.
Collecting and preparing datasets is one of the often underestimated efforts in machine learning solutions. In this phase, there are several challenges that need to be confronted by machine learning teams.
3) Challenge: Feature Extraction can Become a Reusability Nightmare
Feature extraction is one of the common aspects in the lifecycle of machine learning solutions. Conceptually, feature extraction focuses on identifying the key aspects of the data that can be used by machine learning models. While feature extraction is conceptually simple for a single model, the picture gets really complicated for organization building dozens of machine learning models that share a common set of features.
One of the most effective techniques I’ve seen to address the feature reusability challenge is to build a centralized feature store that maintains a persistent representation of the features used by the different machine learning models. This is the approach followed by stacks such as Uber’s Michelangelo.
4) Challenge: Labeled Datasets are Incredibly Hard to Produce
Supervised learning models dominate the machine learning ecosystem and they typically require large volumes of labeled datasets. However, producing those datasets is incredibly difficult, resource intensive and typically results impractical for most organizations.
Automated data labeling is an effective approach to deal with the data labeling nightmare. The principle is to create routines that can probabilistically assign labels to training datasets. From the technology stacks in the market, project Snorkel is one that has been steadily gaining traction in this area.
Experimentation is the cornerstone of any machine learning development lifecycle. The ability to play and test different models and architectures many times represent the difference between success and failure in the machine learning world. However, experimentations also introduces its own set of challenges in the machine learning lifecycle.
5) Challenge: The Single Framework Fallacy
Large enterprise cherish the idea of technology consolidation and like to place their efforts into a small number of machine learning tools and frameworks. However, frameworks that are good for experimentation often fall short for production workloads and vice versa. As a result, it is very common for organizations to leverage different machine learning stacks for the experimentation and operationalization stages respectively which introduces certain levels of technical debt and fragmentation.
When comes to machine learning, optimizing for productivity is a better solution than optimizing for consistency. As a result, accepting a world in which companies use different machine learning frameworks should be the standard. An approach that we’ve seen effective is this area is to use an intermediate representation to port models across the different frameworks. ONNX is one of the most robust frameworks to facilitate that.
In the second part, we will continue with new challenges and solutions of machine learning solutions in the real world.