Architecture of a real-world Machine Learning system

This article is the 2nd in a series dedicated to Machine Learning platforms. It was supported by Digital Catapult and PAPIs.

In the previous article, I presented an overview of ML development platforms, whose job is to help create and package ML models. Model building is just one capability, out of many, required in ML systems. I ended that article by mentioning other types of ML platforms, and limitations when building real-world ML systems. Before we’re able to discuss these, we need to review all the components of these systems, and how they’re connected to each other.

The diagram above focuses on a client-server architecture of a “supervised learning” system (e.g. classification and regression), where predictions are requested by a client and made on a server. (Side note: it might be preferable in certain systems to have client-side predictions; others might even motivate client-side model training, but the tools to make this efficient in an industrial ML application don’t exist yet.)

Overview of an ML system’s components

Before going further, I recommend downloading the diagram above, and splitting your screen so you can see the diagram at the same time as you’re reading this rest of this article.

Let’s assume that ‘Databases’ would already exist prior to the creation of the ML system. The components in dark grey and purple would be new components to be built. Those that apply ML models to make predictions are represented in purple. Rectangles are used to represent components that would be expected to provide micro-services, typically accessed via representational state transfer (REST) APIs and run on serverless platforms.

There are two ‘entry points’ to our ML system: the client requesting predictions, and the orchestrator creating/updating models. The client represents the application used by the end-user who will benefit from the ML system. This can be the smartphone app you used to order dinner, e.g. UberEats requesting an Expected Time of Delivery — big use case during the COVID-19 lockdown!

Pre-lockdown picture, courtesy of Wikipedia. There’s a complex ML system that predicts when this guy will reach his destination, thousands of times every day, in hundreds of cities throughout the world! Let’s hope the models used by this system were updated in the last few weeks…

The orchestrator would usually be a program called by a scheduler (so that models could be updated periodically, for example, every week) or called via an API (so that it could be part of a continuous integration / continuous delivery pipeline). It’s in charge of evaluating models made by the model builder, on a test dataset that is kept secret. For this, it sends test predictions to the evaluator. If a model is deemed good enough, it is passed to the model server, which makes it available via an API. This API can be directly exposed to the client software, but domain-specific logic is often needed and implemented in a front-end.

Assuming one or several (baseline) models would be available as APIs, but would not be integrated into the final application yet, you would decide which model to integrate (and whether it’s safe) by tracking performance on production data and visualizing it via a monitor. In our dinner-delivery example, it would let you compare a model’s ETD with the actual time of delivery, on orders that were just delivered. When a new model version becomes available, client requests for predictions would be progressively directed to the new model’s API, via the front-end. This would be done for an increasing number of end-users while monitoring performance and checking that the new model is not “breaking” anything. The owner of the ML system and the owner of the client application would be accessing the monitor’s dashboard on a regular basis.

Let’s recap all the components found in the diagram above, in a list:

  1. Ground-truth Collector
  2. Data Labeller
  3. Evaluator
  4. Performance Monitor
  5. Featurizer
  6. Orchestrator
  7. Model Builder
  8. Model Server
  9. Front-end

We’ve briefly mentioned #3, 4, 6, 7, 8 and 9. Let’s now provide a bit more information, and go over #1, 2 and 5!


In the real world, it is key to be able to continuously acquire new data for the machine to learn from. One type of data is particularly important: ground-truth data. This corresponds to what you want your ML models to predict, such as the sale price of real-estate property, a customer-related event (e.g. churn), or a label to assign to input objects (e.g. ‘spam’ on incoming messages). Sometimes, you observe an input object, and you just need to wait for a certain amount of time to observe the thing you wanted to predict about this object; for instance you wait for the property to get sold, you wait for the customer to renew or cancel their subscription, you wait for the user to interact with emails in their inbox. You may want the user to let you know when your ML system got the prediction wrong (see illustration below). If you want to give your user the ability to provide that kind of feedback, you’ll need a micro-service to send it to.

In case you don’t think ML platforms are important… (what?!)


Sometimes, you’ll have access to plenty of input data, but you’ll need to create the associated ground-truth data manually. This is the case when building a spam detector, or an object detector from images. There are ready-made and open-source web apps to make data labeling easier (such as Label Studio), and dedicated services for outsourcing the manual task of labeling data (for example, Figure Eight and Google’s Data Labeling Service).

Airplane classification: Label Studio in action


When you have an initial dataset for the machine to learn from, it’s important to define how to evaluate the planned ML system, before setting out to build any ML models. In addition to measuring prediction accuracy, evaluation of short-term and long-term impact via application-specific performance metrics, and system metrics such as lag and throughput, are desirable.

There are two important objectives behind model evaluation: comparing models, and deciding whether it is safe to integrate a model into an application. Evaluation can be performed on a predetermined set of test cases, for which it is known what the prediction should be (i.e. the ground truth). The error distribution can be examined, and errors can be aggregated into performance metrics. For this, the evaluator needs access to the test set’s ground truth, so that when it gets predictions in input, it can compute prediction errors and return performance metrics.

I recommend to make it a priority to implement this evaluator, well before building ML models. Evaluate predictions made by a baseline model, to provide a reference. Baselines are usually heuristics that are based on input characteristics (a.k.a. features). They can be super-simple, hand-crafted rules…

  1. for churn prediction, your baseline could say that if a customer logged in less than 3 times in the last 30 days, they are likely to churn;
  2. for food delivery-time prediction, your baseline could average delivery time for the order’s restaurant and rider over the last week.

Before developing sophisticated ML models tomorrow, see if your baseline could create value today!


The next step towards deciding if a (baseline) model can be integrated into an application is to use it on the inputs encountered in production (called ‘production data’), in a production-like setting, and to monitor its performance through time.

Computing and monitoring performance metrics on production data requires acquiring and storing production inputs, ground truths, and predictions in a database. The performance monitor would consist of a program that reads from that database, calls the evaluator, and of a dashboard that shows how performance metrics evolve with time. In general, we want to check that models behave well through time and that they keep having a positive impact on the application in which they’re integrated. The monitor could also be augmented with data visualization widgets that show production data distributions, so we can make sure they are as expected, or we can monitor drift and anomalies.

A monitoring dashboard for churn models (source)


When designing a prediction API, a decision needs to be made as to what the API should take as input. For example, when making predictions about customers, should the input be the full feature representation of the customer, or just the customer id?

In any case, it is common that the full numerical representation would not be readily available (as it would be for text or image inputs), but it would have to be computed before it can be passed to a model. For a customer input, some features would be already stored in a database (for example, date of birth), and others would require computation. This would be the case for behavioral features that describe how the customer interacted with the product over a certain period of time: they would be computed by querying and aggregating data that logged customer interactions with the product.

If, by nature, features do not change too often, they could be computed in batches. But in ML use cases such as UberEats’ Expected Time of Delivery, we could have “hot” features that would change rapidly and would need to be computed in real-time; for instance, the average delivery time of a given restaurant over the last X minutes.

This calls for the creation of at least one featurization microservice that would extract features for a batch of inputs, based on their ids. You may also need a real-time featurization microservice, but this would be at an additional cost in the complexity of your ML system.

Featurizers may query various databases and perform various aggregations and treatments on the queried data. They may have parameters (such as the number of minutes X in the example above), which may have an impact on the performance of models.



The orchestrator is at the core of the ML system and interacts with many other components. Here are the steps in its workflow/pipeline:

  1. Extract-Transform-Load and split (raw) data into training, validation, test sets
  2. Send training/validation/test sets for featurization (if any)
  3. Prepare featurized training/validation/test sets
  4. Send URIs of prepared train/validation sets, along with metric to optimize, to model builder
  5. Get optimal model, apply to test set, and send predictions to evaluator
  6. Get performance value and decide if it’s OK to push the model to the server (for canary-testing on production data, for instance).

Some more details on step #3 (“prepare featurized training/validation/test sets”):

  • Augment training data (for example, oversample/undersample, or rotate/flip/crop images)
  • Pre-process training/validation/test sets, with data sanitization (so that it can be safely used for modeling or predicting) and problem-specific preparation (for example, de-saturate and resize images).

Ways to run the workflow

The whole workflow could be executed manually, but to update models frequently, or to tune hyperparameters of the featurizer and of the modeler jointly, it will have to be automated. This workflow might be implemented as a simple script and run on a single thread, but computations could be more efficient by parallelizing runs. End-to-end ML platforms allow doing that and can provide a single environment to define and run full ML pipelines. With Google AI Platform, for instance, you can use Google Cloud data products such as Dataprep (a data wrangling tool provided by Trifacta), Dataflow (a simplified stream and batch data processing tool), BigQuery (a serverless cloud data warehouse), and you can define a training application based on TensorFlow or built-in algorithms (e.g. XGBoost). When processing important volumes of data, Spark is a popular choice. Databricks, the company behind Spark, also provides an end-to-end platform.

Alternatively, each step of the workflow might run on a different platform or in a different computing environment. One option is to execute these steps in different Docker containers. Kubernetes is one of the most popular open-source container orchestration systems among ML practitioners. Kubeflow and Seldon Core are open source tools that allow users to describe ML pipelines and turn them into Kubernetes clustered applications. This can be done in a local environment, and the application can be run on a Kubernetes cluster, which could be installed on-premise or provided in a cloud platform — Google Kubernetes Engine for instance, which is used by Google AI Platform, or Azure Kubernetes Service, or Amazon EKS. Amazon also provides an alternative to Kubernetes with Fargate and ECS. Apache Airflow is another open-source workflow management tool, originally developed by Airbnb. Airflow has become a popular way to coordinate the execution of general IT tasks, including ML ones, and it also integrates with Kubernetes.

Active learning for more advanced workflows

As hinted earlier, domain experts may be required to access a data labeler, where they would be shown inputs and would be asked to label them. These labels would be stored in a database, and would then be available to the orchestrator for usage in training/validation/test data. The choice of which inputs to present for labeling could be made manually, or it could be programmed in the orchestrator. This can be by looking at production inputs where the model was right but not confident, or where it was very confident but wrong — which is the basis of “active learning”.


The model builder is in charge of providing an optimal model. For this, it trains various models on the training set and evaluates them on the validation set, with the given metric, in order to assess optimality. Note that this is identical to the OptiML example explored in the previous article:

$ curl$BIGML_AUTH -d '{"dataset": "<training_dataset_id>", "test_dataset": "<test_dataset_id>", "metric": "area_under_roc_curve", "max_training_time": 3600 }'

BigML automatically makes the model available via its API, but with other ML development platforms you might want to package the model, save it as a file, and have your model server load that file.

Result of an “Automated ML” experiment on Azure ML. You can download the best model that was found, or deploy it on Azure.

If you use a different ML development platform or no platform at all, it’s worth architecting your system in a way where models are automatically created by a dedicated service, which takes in a training set, a validation set, and a performance metric to optimize.


The role of a model server is to process API requests for predictions against a given model. For this, it loads a model representation saved in a file and applies it, thanks to a model interpreter, to the inputs found in the API request; predictions are then returned in the API response. The server should allow for serving multiple API requests in parallel and for model updates.

Here is an example request and response for a sentiment analysis model, that takes just one textual feature as input:

$ curl 
-H 'X-ApiKey: MY_API_KEY'
-d '{"input": "I love this series of articles on ML platforms"}'
{"prediction": 0.90827194878055087}

Different model representations exist, such as ONNX and PMML. Another standard practice is to persist models, living as objects in a computing environment, in a file. This requires to also save a representation of the computing environment, in particular of its dependencies, so the model object can be created again. In that case, the model “interpreter” just consists of something like model.predict(new_input).


The front-end can serve multiple purposes:

  • simplify the model’s output, for instance by turning a list of class probabilities into the most likely class;
  • add to the model’s output, for instance by using a black box model explainer and providing a prediction explanation (in the same way as Indico does);
  • implement domain-specific logic, such as decisions based on predictions, or a fallback when receiving anomalous inputs;
  • send production inputs and model predictions for storage in the production database;
  • test new models, by also querying predictions from them (in addition to the “live” model) and storing them; this would allow the monitor to plot performance metrics over time, for these new candidate models.

Model lifecycle management

If a new candidate model gives better performance than the current one on the test dataset, it’s possible to test its actual impact on the application by having the front-end return this model’s predictions for a small fraction of our application’s end-users (canary testing). This requires the evaluator and monitor to implement application-specific performance metrics. Test users could be taken from a list, or they could be chosen by one of their attributes, by their geolocation, or purely randomly. When monitoring performance and getting confident that the new model is not breaking anything, developers can gradually increase the proportion of test users and perform an A/B test to further compare the new model and the old model. If the new model is confirmed to be better, the front-end would simply “replace” the old model by always returning the new model’s prediction. If the new model ends up breaking things, it is also possible to implement rollback via the front-end.

Gradually directing traffic to model B and phasing model A out (source)


If you’re curious about real-world ML, I hope this article was useful in showing why ML development platforms, and model building in general, aren’t enough to create a system that has a real impact on end-users.

If you’re serious about building high-value, proprietary ML systems for your company, and if you’re looking to learn more, check out my online course that I will be launching on April 21, 2020: OWN MACHINE LEARNING!

Sharing the power to create value with Machine Learning systems 💪🦾 Author of the ML Canvas. Course creator at

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store