Principles of Good ML System Design

9 min readJul 19, 2021

Machine learning systems are designed to generate maximum business value from ML models used in services and products. If you believe the media hype around AI, you could think that data scientists only focus on achieving state-of-the-art (SOTA) performance and designing ingenious model architectures. The reality is a bit different, and data scientists have many more objectives to accomplish.

When building AI systems, it’s always good to take a divide-and-conquer approach with your goal. This means breaking down the problem statement into solvable components, and studying how machine learning could help alleviate certain problems. A good understanding of limitations can help you build better products.

ML models in development are challenging, but deploying them to production creates a whole new set of challenges. Model performance can degrade over time, which can cause losses for the business.

So, throughout development and deployment, it might turn out that SOTA models aren’t even the right choice for your project. A good data scientist won’t try to build a SOTA model for a marginal improvement in a business case, because models like this are complex, take up a lot of memory, and usually predict slowly.

Compute costs associated with training SOTA models

If SOTA models aren’t the main focus, what is? Here’s what Machine Learning and Product Development teams need to consider:

What should the end product look like?
What are some of the tradeoffs we can make to improve models?
Does any kind of marginal improvement create a significant impact on user experience?
How do we replicate the experimental performance in production?
When could the current model(s) possibly stop giving the best output, and how do we upgrade them?

Designing a Machine Learning System

As mentioned by Chip Huyen, designing a Machine Learning System is an iterative process. Although it’s not a one-size-fits-all scenario, it’s usually good practice to keep updating your system. In the above figure, we can see how the output of each step feeds back to the previous stages depending on the:

optimizations we want to make,
errors we want to solve,
improvements we want to make to the overall quality of the product.

Project Setup

Before we even begin building models, we should gather much information as possible about the project:

Goals — What are we trying to achieve?
User Experience — How will our clients interact with the product?
Performance Constraints — What’s the acceptable tolerance for errors, how fast do these predictions need to be?
Evaluation — Which metric(s) do we use to maintain a solid understanding of our model in production?
Personalization — How personalized do we need the models to be for our clients?
Project Constraints — How much compute is available for training and deploying? Who would be working on the project? What are some of the readily available tools that could be used? What sort of timeline do we have to work with?

Data Pipeline

A machine learning model is as good as the data it’s trained on. Any unexpected changes to the data could nudge us to restart our experimentation process. So, it’s necessary to have a complete understanding of your data and design pipelines:

Data availability and collection — How do we gather quality data? What are the costs? If the process is manual, can we automate parts of it?
User data — What sort of data do users provide? How can we use this as feedback to improve our models?
Storage — How are we currently storing data? What are the Consistency Availability Partition-tolerance (CAP) tradeoffs?
Data pre-processing — In what form does the data arrive — unstructured or structured data? How do we perform the best features to be ingested by the model? What’s the distribution of the individual independent and dependent variables in the train and test data?
Challenges and Privacy — How do we handle data in a safe and secure way? Are there any PII leaks we should be concerned about? How do we anonymize Personally Identifiable Information (PII)?
Biases — Our models should be fair to build user trust. Is there any sort of biases present in the data itself?

Model Selection

Not all models are the same, so what sort of problem statement are you trying to solve? Is it:

A supervised learning problem — Classification, Regression, etc.
An unsupervised learning problem — Clustering, etc.
A data generator — GANs, etc.

Simpler models don’t need to perform well, but they can still provide a baseline to compare with the next models you experiment with. There are generally 3 types of baselines:

Random baseline — What’s the expected performance if all predictions are at random?
Human baseline — How well do humans perform the same task?
Simple Heuristic — Are there any simple rules which can serve as a simple model? If the performance is good enough, then machine learning models have to beat it by a small margin.

It’s quite easy to drift from our goals by trying unnecessarily complex models. They’re costly to train, and they lower the returns from the project. Usually, it’s better to take a much simpler approach and build on top of that.

Training

Most AI systems aren’t basic dot fit and dot predict. Training usually creates new issues to solve, but you can be prepared for them before they happen. Those issues include:

underfitting and overfitting,
delayed, or no convergence during gradient descent,
dead neurons.

Models can have bugs. These bugs might not break the system, but they can surely cause poor results and destroy the user experience. Some common bugs are:

Theoretical Contraints — wrong assumptions about the data.
Sub-par model implementation — model too complex, or too simple for the problem.
Poor choice of hyperparameters
Data problems — unclean data, or overly pre-processed data.

Scalability

Scaling is difficult. When you have limited computing resources, it’s even harder. But, even as your company grows, it’s good to be frugal and maximize output while optimizing resource usage.

When training on large datasets, which don’t fit into memory at once, methodologies for distributed training are being researched, and show promise for training complex models. If compute resources aren’t available, we can try and pre-process the data to optimize the tradeoff between model performance and compute. Some of the ways we could do that are shuffling or training in batches.

Also, using a distributed methodology during training and prediction forces us to think from a load-balancing point of view. The master-worker would be consuming a lot of resources in comparison to the other workers. A quick way to address this problem is to use a smaller batch size on the master-worker, and larger on the others.

Serving

Once the model(s) are live in production, it’s important to get feedback from our users and figure out a way to improve our models — either automatically through active learning, or offline updates later downstream. This is a critical step from an MLOps point of view, as model quality usually gets worse over time after deployment.

Also, if models are performing well but users don’t trust them, how do we show users how they work? Low trust is not good for user adoption. That’s why explainability in AI is important. We need to be able to explain why our model made any individual prediction.

The way we serve our model is also not obvious. There are different types of trade-offs, for example:

Inference on edge devices consumes more battery and memory, making it harder to collect user feedback.
Inference on the cloud increases latency between prediction and user consumption and adds plenty of engineering challenges as well.

MLOps

Machine learning models in production don’t perform the same over time. In real life, factors like changes in the nature of data, or the relationship between independent and dependent variables, influence the predictive quality of the model.

It’s safe to say that the work of data scientists doesn’t end at just building experimental machine learning models. It’s important to also maintain them, and try to improve them gradually.

Performance Tracking

There are four methods for performance tracking:

Offline Methods

It’s been a while since your last trained model was deployed to production. How do you evaluate the quality of your model, if you don’t have clean-labeled variables?

A basic way to check how much your model might degrade is to retrain the model on current data and check the performance delta between the metrics from both trainings. If this gap is not acceptable, it’s time to experiment again.

2. Examining the distribution of data (Data Drift)

Data Drift due to sudden surge of spammers

Real-life data is messy and often unpredictable. The nature of the data can change over time, and this can degrade model performance. New data might even come from the same source as before, but unseen events can still add noise that throws off the model’s ability to perform.

It becomes important to analyze these changes in distribution and learn to anticipate them. Various test cases can be sprinkled throughout the data layer and feature engineering layer, to make sure that the model ingests data like it’s supposed to. These can be logical checks for cases when we have domain knowledge, or even rough estimates for unbound feature values.

3. Changing patterns and relationships (Concept Drift)

When modeling a set of independent variables for a dependent variable, the model will give expected performance if the nature of the relationship remains the same between the two. What if, for example in a classification problem, a new class is introduced? We’d have to retrain the model if identifying this new class was important for the business.

4. The error tolerance type

There are various metrics data science teams use to keep track of model performance, for example:

precision,
recall,
accuracy,
F1-score.

Maximizing these numbers is important, but they’re not the only numbers you need to get right.

For example, say you’re building a classification pipeline to detect early signs of malignant tumors. Some of the error metrics count False Positives and False Negatives. We’d want to limit both of them, considering that no positive case should go unnoticed, but we’re ready to accept a higher rate of False Positives than False Negatives.

Model Retraining

The best way to retrain your models depends on the business use case, available compute, and acceptable costs. You should consider things like:

When is the right time to retrain the model? — Suppose we expect the model’s life to be around 6 months basis domain knowledge, we should ideally retrain every six months.
How much data do we use to retrain to model?
If the model needs to continuously learn, then it might be helpful to update it daily or as a batch every short period of time on new data (on top of the existing model).
If the nature of data completely changes after a period, we would need to retrain a new model on an entirely new dataset.
Model training can be expensive. If retraining doesn’t add considerable value from the user’s point of view, you can hold out for a while.
How do we know that the model has degraded? — Some leading metrics can help us identify early signs and act as triggers for model retraining. If the metrics breach a certain acceptable range of threshold, it’s time to update it.
So all things to be done manually? — Not necessarily. The process can be automated as per our requirement if all the necessary information is available at our disposal.
Model monitoring tools — neptune.ai, Comet, MLFlow, etc.
Orchestration tools — Airflow, Kubeflow, etc.

Conclusion

Before you do any modeling, it’s useful to gather as much information as possible for the problem you’re trying to solve. Chasing state-of-the-art performance to gain marginal improvements over simpler models doesn’t improve user experience.

The model is as good as the data it’s trained on. It’s best to learn everything about your data by doing exploratory data analysis. Remember about computing resources and costs when you choose your model type.

Always try to automate things that can help speed up daily work and lower costs — could be by automatically labeling new data for reducing manual tasks. Know the limitations of the existing workflow and models, and update them periodically to meet business requirements.

Keep learning about the industry. Domain knowledge always helps in building great data science products. Anticipate changes in the data that the model is used to.

Hope this advice helps you build better models. Thank you for reading!