MLOps Stacks: How SWE Meets MLE

Published in

tmobile-dsna

6 min readSep 4, 2024

As the excitement around machine learning and generative AI grows, companies are laser-focused on building smarter algorithms and squeezing every drop of performance from their models. Yet, in the rush to innovate, the tough reality of managing and deploying these models often gets lost in the shuffle. Even seemingly straightforward regression-based models can collapse in production to poorly managed data, contributing to reports that many ML models become ineffective — or even unusable — if they make it to production at all:

“Around 90 percent of machine learning models never make it to production” -VentureBeat

To build ML systems at scale, there needs to be a systematic way to handle failures, model updates, and continuously changing data. Recent focus on building better infrastructure has allowed products like Databricks MLOps Stacks to come to fruition.

This summer I had the privilege of integrating the MLOps Stacks workflow to get a ML model to production. It was interesting to learn how software development practices transcend MLE practices, revealing what translates well and what doesn’t. This article will go over some interesting ways good software engineering practices have been used in MLOps Stacks and what stones have been left unturned.

Continuous Integration and Deployment (CI/CD)

In traditional software engineering, continuous integration and deployment ensure that code changes are frequently merged, tested, and deployed to improve quality and speed. In machine learning, achieving this level of automation is particularly challenging due to the non-deterministic nature of models, extensive training times, complex dependencies, and waves of new data that can quickly render models useless. But these challenges should only incentivize CI/CD as the absence of such practice would inevitably cause large models to fail in production.

Using Databricks MLOps Stacks and Azure DevOps, we achieved automation across various stages:

Model Development: Rather than following a manual process to run each stage, pipelines can trigger training, *validation, testing, and retraining through YAML files that orchestrate the entire process. Here is an example of how training is orchestrated in YAML

Model lifecycle managed through Databricks

Model Versioning: Job runs are reproducible, with the best-performing model tracked under a ‘Champion’ alias. Comparing historic models allows for conditional retraining to become feasible, such that best performing models are preserved rather than the most recent. Developers can specify what best performing is with customized performance metrics.

Model Versioning allows us to track best performing models

Workflow: Pipeline configurations in Azure DevOps streamline transitions from development to test and production environments. Rather than making individual adjustments across workspaces, changes — such as tweaks to hyperparameters or features — can be managed efficiently by triggering the pipeline. This approach ensures consistent updates and minimizes manual intervention.

*An important correction to validation on MLOps Stacks: Validation incorrectly pulls from the raw data rather than the data that contains features generated and trained on. This means that we will encounter a schema mismatch error during validation task and the model will not be evaluated correctly. To fix this, I created a validation set by making a train/val split on the same data. Once the dataset is correctly made for validation, it can be referenced in the validation notebook. Make sure to do the same for your use case!

While there are some improvements to be made in these pipelines regarding model explainability, Databricks MLOps Stacks provides a new assurance that high-quality models are deployed while maintaining the system’s robustness.

Version Control

When models unexpectedly fail, version control is critical. Databricks MLOps Stacks builds upon traditional software version control practices with Git while incorporating ML-specific tools like MLflow and Unity Catalog. These tools not only track code versioning but also datasets, model configurations, and trained models. This ensures that every iteration of a model, along with its associated data and parameters, can be tracked, compared, and reproduced.

Additional version control within the Azure DevOps Deployment pipelines also allows for regularly scheduled deployments on a code repository artifact that remains independent of the workspaces.

Infrastructure as Code (IaC)

Infrastructure as code is the ability to provision and support computing and code infrastructure using code instead of manual processes and settings. This abstracts the environment configurations for software engineers and translates especially well for machine learning engineers. To easily duplicate environments and reduce configuration errors, IaC can be essential to consistently managing cloud resources programmatically and managing ML workflows.

Without effective infrastructure, you might manually set up a training environment with specific GPU instances, memory allocation, and data storage configurations. However, if you forget to replicate these exact settings when deploying the model to a production environment, the model might underperform due to insufficient resources or incompatible configurations. This can lead to issues like longer inference times, increased costs due to resource inefficiency, or even outright failures if dependencies aren’t properly aligned. IaC ensures that every environment — whether for training, testing, or production — has consistent and correct configurations, preventing these kinds of discrepancies and ensuring reliable model performance.

IaC has been integrated into MLOps Stacks through Databricks Asset Bundles.

What’s Left:

Databricks MLOps Stacks is making strides in how ML models are being productionized but let’s look at what has yet to be covered:

End-to-End Integration Testing for Entire Pipelines:

While MLOps Stacks do support integration testing, conducting comprehensive end-to-end integration testing for entire ML pipelines remains complex and less standardized compared to traditional software applications. The dynamic nature of ML models, including variations in data, model updates, and external dependencies, makes full end-to-end testing more difficult to implement and maintain consistently.
We often ran into instances where pipelines reported successful jobs but found hidden errors within the jobs themselves that seemed undetected. Tests to improve model visibility remains preliminary.

Agnostic Environment variables and Functions:

Several instances of poorly generalized code throughout MLOps Stacks causes friction between the code infrastructure and how pipelines operate to move code between workspaces. In short, the code base is not oriented for development pipelines.
For example, model references in Unity Catalog follow a three-level naming scheme that references the workspace, schema, and model name.

model_name: dev.house_prediction.mlops_model

These kinds of variables — along with other workspace-specific dependencies — make it difficult to seamlessly transition between workspaces via pipeline as they reference the workspace themselves.

Model Agnostic Interpretation Methods:

Ideally, once infrastructure for machine learning models is set, good infrastructure should generalize to more complex models.
The tricky part is creating model agnostic interpretation methods such that models are evaluated in a generalizable way.
For example, tools like SHAP (SHapley Additive exPlanations) can be integrated into the MLOps pipeline to provide consistent interpretability across various models, from simple decision trees to deep neural networks. SHAP works by assigning importance values to each feature, explaining how they influence a model’s predictions regardless of the model type. A standardized approach could be the corner stone for allowing companies to push out models at incredible speed.

Conclusion

Overall, MLOps Stacks is moving in the right direction for how machine learning should be utilized at scale. As machine learning models evolve, the infrastructure must also adapt to handle increasingly complex workflows, from data preprocessing to conditional retraining and deployment. This balance is key to making machine learning a reliable, scalable, and understandable part of production systems. To allow more engineers to adopt MLOps Stacks to their use case, I wrote extensive internal documentation to smoothen the learning curve for new users.

As the excitement around machine learning continues to grow, it’s crucial to remember that solid software engineering practices are the backbone of successful machine learning systems. It’s these foundational principles that ensure advanced models are reliable, maintainable, and capable of delivering consistent value in real-world applications.