ML Ops: The Toolchain and the Value Chain

Published in

the table_tech

11 min readNov 29, 2020

Data science teams are doubling every year and each year there are 2x-5x more models delivered into products and businesses,¹ but Gartner predicts that 85% of AI projects will fail to generate value for enterprises. This is partly due to a machine learning (ML) pipeline that is surprisingly manual, a disconnection between the data scientists building models and the ML engineers deploying them, and a lack of basic tooling for this developing workflow. The emerging ML OPs (machine learning operations) ecosystem aims to facilitate better collaboration between data scientists and ML engineers and to create a more automated pipeline with greater transparency. The goal is to create a faster, more iterative workflow that results in better ML models and higher quality outcomes in production.

The ML Ops ecosystem is emerging at the same time as the production ML market. The pain points addressed by the toolchain are clear and it’s an exciting time as the canonical stack (i.e. MEAN) for ML Ops has not yet been established. The corresponding value chain is still forming, and the market opportunity is still materializing. This article provides a high level overview of both.

“We’re in the golden age of AI, but the dark ages of infrastructure.” — Determined AI

The Toolchain

1. Data Preparation (extraction, validation, labeling): Data preparation involves collecting, cleaning, and manipulating data into a form that can be effectively used by ML models.

Pain points: Data scientists spend up to 80% of their time on data prep.³ Current ML models are brittle, which is addressed by adding ever more data. Some of this data is useful, but some is useless, and some is harmful to model performance. Collecting, storing, labeling, and training all this data is expensive and inefficient. Keeping a human in the loop during data preparation adds to these inefficiencies. Training and retraining models with billions of parameters leads up to 25% of revenue to be spent on cloud resources.⁴ These dynamics result in diminishing marginal returns on data and lower margins relative to traditional software businesses.
Potential: Some data preparation tools overcome these dynamics, improving the economics of ML by reducing the cost of preparing data or reducing the amount of data that needs to be prepared in the first place. Tools that occur earlier in the data preparation process can have a larger downstream impact.

2. Featurization: A feature is a measurable property of an object or phenomena being observed and is used as an input into an ML model. Features can be thought of as attributes/variables in a data set (i.e. age, purchase price.) Features are used throughout the ML workflow and therefore require their own iterative pipeline.

Pain points: Finding good features is one of the hardest parts of ML, complicated by the difficulties of building and managing effective data pipelines, which can also be one of the costliest pieces of ML infrastructure. A feature store serves as a data pipeline that transforms raw data into feature values, stores and manages those values, and consistently serves feature data in training and inference workloads.
Potential: Tracking features can explain changes both downstream (outputs) and upstream (data ingestion and prep), positioning feature stores as a central point of visibility into the entire ML pipeline. Feature stores also enable economies of scale within ML organizations since thousands of features can be reused across models and throughout the organization. This can allow ML teams to build and deploy new features in hours instead of months.⁶

3. Model Development: (tuning, training, testing, and tracking): Model development involves the process of adjusting parameters, tracking experiments, and testing results. The goal of this step is to design the highest performing model.

Pain points: Model exploration requires a lot of iteration as hyperparameters are tuned to optimize model performance. The combinatorial complexity of hundreds of configurations makes this a complicated task and each set of hyperparameters requires the model to be retrained. Given the cost of retraining, effective tuning and optimization tools are paramount to controlling costs at this stage.

Winning Kaggle teams often build hundreds of models before arriving at a winning model. Source: Verta.ai

Potential: Tooling that focuses on hyperparameter tuning can enable researchers to search through many hyperparameter combinations with a simple API call instead of spending endless hours manually testing different configurations. Throughout the workflow, effective versioning of data, notebooks, and models is critical. There are many startups vying to become “the git for ML” given the inability of git to handle ML models (challenges with large, binary files, etc.) This is a category where one category winner could emerge. It is also a category in which the preference may be for open source tools built by the community.

4. Deployment (handoff, infrastructure orchestration, and model serving): The workflow thus far has been led by data scientists and researchers, which now have to pass their models off to ML engineers that will deploy them into production. ML engineers are responsible for running A/B and canary tests, serving the model, sunsetting old models, and orchestrating the underlying infrastructure.

Pain points: One of the core challenges in this workflow is the hand-off from a research environment to a software engineering one. Data scientists and researchers prefer to use notebooks as they allow for the combination of code and visualizations. In many cases, transitioning a model from research to production can actually involve translating a Python based Jupyter notebook to production code.⁸ Once handed off, orchestrating a distributed compute infrastructure adds an extra layer of complexity to the deployment process. This can be especially challenging for non-Kubernetes experts.
Potential: Tools that enable the productionization of training code can help to facilitate collaboration while tools that enable data scientists to own deployment and maintenance can increase the speed of iteration cycles. Deployment tools should seamlessly integrate with deep learning frameworks, automatically provision machines, handle efficient distributed data loading, and ensure fault tolerance. Cloud service providers have a strong presence in this category.

5. Monitoring: ML models are not static. They are constantly intaking new data and updating their predictions. Monitoring models for early signs of trouble and assessing differences in model performance is critical to preventing model decay and failure.

Pain points: There are two main ways data can cause problems in production: something goes wrong with the data itself (i.e. corrupted data) or the data changes as a result of a change in the environment (drift.)⁹ When either occurs, the assumptions the model relied on during training are no longer valid in inference. For example, the large shifts in data patterns resulting from the COVID-19 pandemic led to extreme drops in model accuracy, directly impacting business operations. Since ML models fail silently, it might be weeks before a failure is discovered. Without monitoring it is impossible to detect data and model issues before business KPIs are negatively impacted.

Potential: Monitoring technologies provide real-time visibility into model failures. Observability tools go one step further by identifying the root cause. This requires end-to-end visibility into the full data pipeline, including upstream and downstream dependencies. These tools leverage advanced statistical analysis, the results of which need to be analyzed alongside model performance metrics such as accuracy, stability, reliability, fairness, conceptual soundness, segment analysis, and generalizability.¹⁰ The diversity of methods, applications, and data context reduces the likelihood that one dominant monitoring solution prevails and, if the APM market is any indication, multiple solutions may take hold in this category.

6. Explainability: The underlying operations of ML models are opaque. This can lead to biased recommendations and unexplainable outcomes, which deters broader production ML adoption.

Pain points: To highlight one of many examples, the lack of explainability was demonstrated when Goldman Sachs admitted that they couldn’t explain why the algorithm underpinning the Apple Card was resulting in gender discrimination. A coming wave of regulation could demand greater transparency and explainabiltiy in ML systems.
Potential: Explainability tools essentially use advanced statistical methods (i.e. Shapley values and integrated gradients) to break down the marginal contribution of any given feature to a prediction. This should lead to more trust in ML models, improve the debugging process, and provide valuable feedback for model development. While the market for explainability tools is somewhat constrained to heavily regulated industries right now (such as financial services), if the development of rest of the toolchain leads to increased adoption of production ML across all sectors, demand should increase for these tools.

The Value Chain

1. The Value Proposition: By the time a model is in production, the company has already invested several million dollars into it and executives find themselves responsible for assessing the precise impact that the model is having on business outcomes. In theory, by automating ML Ops, skilled teams can spend more time on research and create better models, but this must ultimately be tied to better business outcomes. Given the non-deterministic nature of ML systems, KPIs can be harder to define upfront and it can take 6–12 months to prove out a tangible ROI on ML OPs solutions, according to Valohai. While traditional APM solutions have struggled with incorporating KPIs and tying results to business outcomes, since ML use cases start with a specific business question, they should be easier to tie to a specific business outcome.¹²

2. The Build vs. Buy Calculation: Building ML OPs tools internally is not usually a successful strategy. There are many examples of internal efforts that ended up costing millions of dollars and wasting years of time that instead could have been funneled towards building a proprietary data and model development advantage. Building in-house tools may make sense for very specialized use cases¹¹ or enterprises that want to develop AI/ML as a core capability, but even sophisticated tech organizations often find that building tools internally is not the highest ROI activity for engineers. Companies that are just testing ML or using it for lower-scale and/or internal use cases will probably adopt full-stack, off-the-shelf solutions.

3. The Product: The handling of data is the key axis of distinction when drawing any comparisons between ML Ops and traditional DevOps. Many tools focus on workflows around the model itself but lack support for the data.⁶ The handling of data should be a key consideration in the design of any solution in this space. To that end, ML Ops tools need to be easily integratable with existing data sources, infrastructure, and open-source frameworks. These integrations need to be lightweight and minimally intrusive while being able to scale to handle terabytes of data. Tools should require minimal access to underlying data.

An open source strategy will be critical to any ML Ops product. Strategic selection of which parts of a solution to open source and/or when and how to partner with fully open source tools will be a key determinant of success.

4. The Market: While the market beyond highly sophisticated tech companies (AVs, robotics, etc.) is still in the early stages of development, it is important to note that the size of the market opportunity varies by solution. The more generalizable the tool, the broader the customer base. The opportunity will be the largest for tools that can be used on a variety of infrastructures, model types (i.e. classical ML and deep learning), data formats (i.e. text, images, and audio) and workloads (both training and inference.) The more visibility into the entire workflow, the better the odds of expanding into adjacent areas of the toolchain. A comprehensive view of data throughout the workflow was a key determinant of success in the APM market. End-to-end (E2E) pipelines attempt to do this but may struggle as data science teams tend to build singular systems with unique requirements. For a best-in-class point solution to become a more fully integrated platform, it will need to be positioned to evaluate and/or automate production and performance in the context of the overall system.

5. The Competition: CSP’s may have a distribution advantage, but as more and more companies grow more comfortable with ML, that advantage could wane. While platform solutions ease initial adoption, as organizations scale and their requirements become more complex, it can be hard to adapt these less flexible solutions to new technologies. For example, even though transformers first appeared in 2019 and are now the industry standard for NLP, they are still not supported by most popular platforms.⁵ Additionally, some platform solutions are priced at a significant premium to alternatives. While true multi-cloud adoption is nascent, it seems to have prevented CSP domination in the APM market and may do the same in ML Ops.

We are coming out of the “dark ages of ML infrastructure,” but we are still at the dawn of ML Ops. The landscape is currently crowded as best practices are evolving every 6–9 months. Ultimately, a canonical stack must emerge, but a high amount of sectarianism in ML OPs indicates that there may be room for multiple winners in some categories.¹³ Tools that are built by the community or by former production ML practitioners will define categories. Whether or not these tools enable broader adoption of enterprise ML will define the market. Tools will be insufficient to solve some of the biggest pain points in ML Ops. A reorganization of the ML product teams equipped with these tools may also be required to better distribute influence among data scientists and ML engineers throughout the entire workflow. Organizational change may ultimately dictate market timing.

My own views on the space are also at the early stages of formation. If you’re working on something related, please get in touch, I’d love to learn more.

**Views are my own and do not represent the views of the table or Playground Global. Thank you to Shohini Gupta, Linus Lee, and Chelsea Goddard for feedback**

References

ML Ops: The Toolchain and the Value Chain

The Toolchain

The Value Chain

Written by Justine Humenansky, CFA