5 Lessons Learned From Building Machine Learning Systems

Published in

Loopio Tech

7 min readDec 14, 2022

Introduction

Machine learning (ML) capabilities are increasingly integrated into products and services. However, it is not enough to integrate ML; it is also necessary to ensure the scalability of said ML solutions. There are a lot of technical, infrastructure, and cost-related considerations to keep in mind when building industry-standard, production-grade ML systems. Having transitioned from being a backend engineer for many years into the field of applied machine learning, I now get to build ML systems in my day-to-day routine at Loopio, and I have learned a lot along the way. This article documents the top five lessons learned while building ML systems. Before diving into these lessons, it is important to highlight what ML systems are.

What are ML Systems?

ML systems go far beyond just the modeling components that recommend, predict, or forecast. The modeling components, or models, are machine learning algorithms trained to learn patterns from data given a set of objective functions. ML systems are end-to-end systems developed around these machine learning algorithms. On the other hand, a complete ML system encompasses all of the following phases:

Data-focus: components in this phase include data collection and ingestion, data validation and versioning, and data transformation, including specific preprocessing for generating features to be used by individual modeling solutions.
Model Development: components in this phase cover feature validation, model training and model performance analysis evaluation and validation, and model management in model registries.
Model Serving: this includes model deployment and monitoring which may result in further actions (see next point).
Retraining: on top of these phases, ML systems should also be able to update their logic to continually give better results into the future.

The components highlighted above contribute to making the whole system reliable, scalable, maintainable, and adaptable to different changing versions of the problem scenarios, business needs, and problem domain concepts. With these in mind, let’s dig in!

Lesson 1: Miss the problem, miss the solution

While this may seem like a no-brainer, unfortunately, it could be more obvious. Understanding the business problem and the root cause might also imply ML isn’t the only or the best option to solve it. ML is a tool in the technical toolbox. ML problems come in different forms and types and so do the potential solutions. How you frame a problem as a learning problem goes a long way in determining the quality of the solution you might get. Business problems can be framed as either classification or regression problems. They can also be approached as supervised, unsupervised, semi-supervised, or reinforcement learning problems.

How you frame a business need/problem as a learning problem determines your solution’s effectiveness.

For instance, to address customer retention, the problem can be framed as predicting one of the pre-known causal factors of churn. In this case, it would include identifying common trends amongst the different factors and developing business strategies to tackle them. However, it can also be framed as predicting the likelihood of customers churning. In this case, we would want to set up strategies to pay more attention to customers with a higher likelihood of churn. The former is a classification problem, while the latter is a regression problem. The outcomes, final solutions, and strategies from each approach may or may not solve the actual problem or even create more complexities before solving the problem.

Lesson 2: Learn the role of data in an ML system — Quantity vs. Quality

A machine learning system is nearly as good as the data you feed into it. This can be interpreted across different dimensions. One interpretation borders on quantity. This is rather straightforward; the more data points you feed into your models, the more patterns can be understood by the models, making for better predictions. Simple, right? Only one problem here: what you give is what you get — garbage in, garbage out! This is where quality comes into the interpretation. To get reasonable outputs from ML systems, the dataset used must be error-free (i.e., correct), appropriate for the learning solution and have minimal to no missing data points (i.e., complete), and also adds up, make sense relative to itself, and has its outliers properly handled (i.e., the data is coherent).

In summary, while data heavily influence the output of an ML system, data quantity is only one of the influencers; it is often better to have limited data of high quality than a large amount of inferior data.

Lesson 3: Data is King, and so are your technical design choices and strategies

Unless you are training models for competitions or other leaderboard-like situations, you will face several technical considerations and SLA requirements, such as latency, response time, requests handled per second, etc. These are often as challenging as they are interesting. System design choices are often influenced by technical requirements, business needs, and what is most effective for the domain problem. Design strategies employed could make or break the efficiency of your model and overall ML solution.

Batch or real-time?

One such design dilemma is deciding on batch versus real-time prediction of ML solutions. When latency is of the highest consideration, a pre-computed solution, also known as batch serving, might be a better approach. Real-time serving introduces dependencies such as feature retrieval, feature processing, and the actual inference before results are available to the end user, increasing the latency. When resulting accuracy and freshness are a higher priority, it would be worthwhile to consider real-time serving. For instance, in a location-based solution where location changes frequently, end users want to see the most relevant predictions as their locations change. It would be highly inaccurate to employ batch serving in this scenario.

One such example is Loopio’s own Recommended Experts feature. Batch serving is leveraged when recommending subject-matter-experts (SMEs) to respond to RFP questions. We precompute and index word embeddings in order to perform online similarity searches. Ideally, this would be impossible to do in real time.

Global or personalized model?

Another design dilemma is knowing when to reuse models across customers versus adopting hyper-personalization in modeling ML solutions. Hyper-personalization refers to model personalization to a high degree of specificity to make accurate predictions. Making a tradeoff between model reuse (same/global ML models for all customers) and hyper-personalization can be difficult. It depends on whether your customers can be grouped into subsets that have more similar behaviors. In business-to-business (B2B) software applications, your customers might fall within very distinct verticals. Your engineering team can then choose to develop multiple customized models for the same learning problems for each of those different verticals. Your vertical might even be confined to individual customers.

At Loopio, we are exploring a model-per-customer approach for some of our solutions, and we refer to this as multi-tenant MLOps.

Cost-Value Analysis

Setting up the infrastructure for an ML solution can be expensive, depending on what technical design choices have been made. Building a highly sophisticated ML system is possible, but if it costs more than it brings value, it could be a net drag on the business.

Cost (monetary & effort) and Value need to be considered early on when building ML solutions.

At Loopio, our technical design research must include cost analysis before proposing implementations for different parts of our ML systems.

Lesson 4: The more you observe, the better.

Monitoring and Observability are often used interchangeably, but these are very different concepts. Monitoring refers to tracking and measuring models’ performance metrics, logging these metrics along with inputs from users and outputs of the ML system. These logs help to generate more training data for the models in the system.

Observability, on the other hand, involves giving visibility into the internal states of a machine learning system to allow for an understanding of models’ data and performances across different phases of the ML lifecycle. Observing ML systems allows for detecting faults like training-serving skew and data distribution changes — data and concept drifts — which happen over time.

Observing and identifying faults should result in faster detection of problematic systems and quicker resolution times. We can thus conclude that monitoring is necessary but insufficient to ensure ML system performance; adding Observability to your processes will elevate your ML system and make it superior.

Lesson 5: Always iterate on ML solutions. Always!

A rather unfortunate but common myth is that ML solutions remain the same and always make the correct predictions given continuous input data. This is not true.

There is absolutely no way that ML solutions can be built in one pass and left alone to perform well forever. ML models and the overall system typically degrade over time. Degradation may be due to ML-related factors, as mentioned in the previous section, or related to deployment or dependency failure. These models often need to be retrained at frequencies that vary across different problem spaces and solutions. The frequency of retraining should depend mainly on observations made from the monitoring and observability phase, which is another reason to invest in a well-observed system.

Final Words

As a recap, the lessons learned from building ML systems are summarized as follows:

Keep the business problem front and center.
Data quantity and quality are both important.
Be intentional about technical design choices and strategies.
It is critical to monitor and observe ML in the wild.
Iterate on the ML solution for continued success.

This piece is by no means an exhaustive list of lessons to be learned from building machine learning systems. At Loopio, we have spent the last couple of years getting our ML systems up and running. It has been challenging, rewarding, and fulfilling, and we are continuously improving upon our implementation and learning from industry best practices. We aim to share more about the different components of our ML systems in subsequent posts.

If you want to chat about this post or your experience building ML systems, feel free to reach out to me on Linkedin.

Also, if you are interested in our work at Loopio, check out career opportunities across our Engineering, Product, and Design teams!