Beyond Model Code: Essential Principles for Productionizing ML Models

Transforming ML models into robust and efficient production systems

Published in

Slalom Build

9 min readJul 10, 2024

Because of customer expectations and competition, there’s been a growing pressure on the management of large companies to use AI technologies. Most leaders are aware of the potential that will be unlocked by their AI initiatives, so they show their commitment to AI innovation by investing in technology and human resources.

However, one report suggests that there is as much as a US$1.5 trillion market cap risk for Fortune 500 companies if they fail to combine technology with the proper strategy and strong change capability. This tells us that technology may not be sufficient to realize the potential. Besides defining an AI strategy, organizations should get every team on board. That means engaging different business units and stakeholders to incorporate their inputs. It’s often challenging to integrate developed models into existing business workstreams. That’s why it’s common to observe a gap between companies’ ML/AI capabilities and their velocity.

When an ML model gets created, there is still so much work to do until the model creates actual value. This is a well-known problem, yet the execution of the remaining work may be confusing. In this post, I discuss the common problems of ML systems we see in the market and the available out-of-box solutions and frameworks. Recognizing that each use case (and organization) has unique needs, I will end with some examples of the custom solutions.

In many organizations, data science teams drive the machine learning efforts. Sometimes these data scientists prepare the data with ETL processes; other times they collaborate with data engineers. Sometimes they are responsible for serving and maintaining the model; other times they collaborate with ML engineers.

One common problem we see across teams is the lack of a common ML framework. In an organization, different project teams may use different ML workflows. For an end-to-end process, from raw data ingress to serving the model, different teams may have different approaches in the same organizations. This may result in duplication of efforts and collaboration blockages. The organizations that lack the standardized ML framework tend to adapt complex systems, which are usually more costly.

Another common problem we see is that some companies completely lack automation. Data scientists do their experiments, but they do not manage and track the results automatically—and sometimes not at all. Many times, they have a sort of version control, but they don’t release models and the end result is delivered manually. This lack of automation not only impedes consistency and speed but also poses significant obstacles for CI/CD integration, resulting in expensive ML operations.

AI/ML operations strategy is an essential factor of a successful digital transformation. For example, implementing a new iteration of an ML model without disrupting the current operations requires careful planning and seamless integration into existing workflows. Another crucial aspect is making the right retraining decisions to ensure that models stay relevant. So, a well-defined AI/ML strategy streamlines these and similar processes and accelerates the path toward achieving business goals through AI. It helps organizations to allocate resources effectively. Because of the advantages of good AI/ML strategy, many cloud providers in the ML space outline some principles and guidelines for ML operations.

Separation of Environments

A key aspect of MLOps is the separation of environments, such as dev/test/prod. The model is usually developed in an offline dev environment, where the model is not connected to a live data source. Real data may be used but often real-like synthetic data is preferred. Using synthetic data in dev environments prevents any unintentional impact of model development on the production systems and limits the risk of exposure of sensitive or personally identifiable data. In dev environments, data scientists experiment with different algorithms, feature engineering, and hyperparameter tuning.

In test environment, teams conduct rigorous testing, including unit and performance testing to identify any issues and bugs. A test environment is still an offline environment, but it mimics the production settings. In the production environment, the model is served to make predictions on real data. In some organizations, the model is trained again in the production environment, but in others they use a trained and tested model. Separation of environments facilitates the implementation of CI/CD pipelines and automatic deployment.

GitOps and Continuous X

Before articulating the importance of CI/CD pipelines in ML operations, I’d like to underline the value of collaboration via version control. Many data science projects require different team members (sometime cross-functional teams) to work together in model development, deployment, and monitoring. To ensure the collaboration, the ML teams use version control tools, such as Git. Version control is essential for keeping a single source of truth for the project. While version control is highly used in ML projects, integrating it in CI/CD pipelines is not very common.

Besides its significance in the software development lifecycle, CI/CD carries a greater importance in MLOps. It ensures the automation and streamlining of model development, testing, deployment, retraining, and monitoring. It automates testing and validating the model code as well as the data. As previously discussed regarding version control, CI/CD and version control are closely intertwined. When a code change is merged in the main repository, automated tests start running. It automatically deploys the model, the prediction services, and monitoring tools. With version control, CI/CD helps teams track modifications in the main code in a systematic way.

Modularization is a must for healthy CI/CD practice. Each component—such as data processing, model training, and evaluation—can be independently tested and deployed. During the testing phase, modular code simplifies debugging efforts because it isolates specific components. Because CI/CD has now expanded beyond continuous integration and continuous deployment, different cloud providers give it different names. For example, AWS calls it Continuous X: continuous integration, delivery, training, and monitoring. Overall, CI/CD in MLOps is very important to ensure agility, reliability, and consistency.

Feature Store

One thing we often see is that the data scientists use too much time and too many resources for repeatable data wrangling processes when they develop models. Some of that time can be saved by using a feature store.

A feature store centralizes the feature definitions and makes it easier to discover and reuse existing features across different teams. Instead of reinventing the wheel and recreating the features from scratch, data scientists can leverage predefined features from the feature store. This reduces the duplicate effort and speeds up the development time.

Additionally, a feature store defines a standardized schema for organizing feature data, specifying data types and formats. This guarantees uniformity and consistency in how features are represented in ML models. Overall, feature stores enhance the robustness, efficiency, and collaboration in ML projects. One important aspect, though, is that feature stores require governance strategies for ensuring data quality, security, and compliance. Effective governance strategies help maintain the integrity and usability of features. Some of the key governance strategies include access control (for implementing role-based access control), data quality and validation (for enforcing data validation and checks), lineage (for documenting feature lineage to see original data and transformations), compliance, collaboration and communication (for defining how teams can share another team’s features).

Tracking and Monitoring

In addition to versioning software, ML projects require versioning models. The model development and evaluation phase requires reliable tracking tools to enforce project versioning. These tools help with recording various aspects of experiments during the model development, such as hyperparameters and interested metrics. Tracking this information allows for easy comparisons and for identifying the best parameter settings. Additionally, some of these tools provide visualization and plots. Cloud providers offer their built-in solutions, such as Amazon SageMaker Experiments. Also, there are standalone products like MLflow.

In addition to tracking tools in experiments, monitoring work plays a critical role in ML operations. We can summarize them in three areas: data monitoring, model monitoring, and services monitoring.

Model predictions are heavily influenced by the data they are trained on and the data they make predictions on. Data monitoring concerns the quality and lineage of that scoring data. When a model works in production, we must make sure that scoring data meets certain expectations. One of these expectations is that the scoring data structure (schema) must meet what the model expects. Another important one is data drift. We monitor the data drift by comparing the statistics of the scoring data against the training data. If there is a significant change (that degree of significance is usually tunable by engineers), data drift monitoring tools detect drift. Depending on the strategy, the model may need to be retrained when a defect is detected. There are other areas in data monitoring, such as monitoring anomalies, data lineage, etc.

Model monitoring concerns the health of a model that is in production. There are some model monitoring tools that track the performance metrics, such as accuracy, precision, etc. Other model monitoring aspects include model explainability and deployment. LIME and SHAP are common techniques to understand models’ decision-making processes.

One last area to monitor regarding the models is fairness. It’s extremely important to identify biases that a model may have during scoring, especially in business areas like healthcare, justice, or human resources. There are different metrics and techniques to quantify and address potential biases. The TensorFlow What-If Tool dashboard is a great visualization tool for potential biases.

Besides monitoring the health and quality of data and model, being aware of the health of underlying infrastructure components (servers, containers, networks, etc.) is very important. Regardless of the ML model’s health status, sometimes outages happen or applications go down. The model endpoints can be totally unreachable. In those cases, without proper infrastructure monitoring, crucial issues like server crashes or network disruptions may go unnoticed, leading to long downtimes and potential financial losses.

Therefore, incorporating comprehensive service monitoring tools into MLOps workflows is essential to ensure the seamless functioning of ML applications. Including metrics such as CPU and memory utilization, number of bytes sent/received, and 99th percentile latency among the monitored parameters is important. Prometheus and Grafana are considered gold standards for Kubernetes environments, offering robust monitoring and visualization capabilities to effectively track and analyze various metrics in real time.

Monitoring the incoming requests, response time, and error codes is important to maintain healthy endpoints. In the above figure, you are seeing a Grafana dashboard reading from a Kubernetes service.

POC-to-Production Gap

The model creation code is only a small piece in comparison to other pieces of MLOps. Some organizations may not fully understand the importance of these other components. As a result, they may underestimate the effort and resources required for these activities and allocate disproportionate focus on model creation. In some cases, their initial focus on proof-of-concept does not adequately prepare them for the full-scale production deployments. Famous AI/ML technologist Andrew Ng defines this reality as “POC-to-production gap.”

I tried to underscore certain crucial yet overlooked facets and elements of MLOps. Naturally, there are numerous other components that I have not addressed. In “Hidden Technical Debt in Machine Learning Systems,” D. Sculley et al. emphasize the significance of the broader ML infrastructure.

*Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.* Sculley, D., et al. (2015) “Hidden Technical Debt in Machine Learning Systems”

At Slalom, we have helped numerous companies in various industries to create state-of-the-art ML solutions and streamline their MLOps. One example customer is a major airline company that was facing challenges in bringing their ML models to production. The Slalom team developed MVP MLOps solutions, focusing on CI/CD and feature store. This solution enabled the client to realize broader business outcomes, increased collaboration between their internal teams, and reduced costs and barriers to future ML innovation.

Another global marketplace company grappled with sluggish and complicated ML development pipelines. They used outdated in-house libraries and were never planned to work at scale. Slalom helped the customer migrate to the latest open-source TensorFlow frameworks and managed GCP services. At the end of the engagement, there was a 300% decrease in training time, and due to improved model training performance, a “record-breaking” increase in advertising click-through rate was observed.

In this post, I talked about the neglected aspects of machine learning operations when putting the models in a production setting. I discussed the common problems of ML systems we see in the market and what the available out-of-box solutions and frameworks are. I underlined the commonly accepted MLOps principles. Finally, I gave some examples of the custom solutions Slalom created for our customers.