The Machine Learning Lifecycle I: Business Objectives to Data Pipelines

Technoid Editor
Technoid Community
Published in
8 min readMar 17, 2023

This is a series about machine learning (ML) from a system design perspective. It is intended for readers familiar with engineering software systems and who have a basic background in machine learning, the fundamentals of which will not be covered here [0].

In the present post, we will introduce the machine learning lifecycle and its initial stages up to constructing data pipelines.

In the next post, we will overview developing models, serving them and monitoring them in production settings.

In future instalments in this series, we will explore key components of machine learning systems, deployment patterns, and MLOps.

Machine learning is more than ever a staple of our everyday experience with technology. Especially with online services, we are likely to encounter predictions generated by models or to have our user journeys supported in myriad ways by such predictions. These ‘predictions’ might be in the form of recommendations from streaming services, the quiet assessment of your transactions as non-anomalous when using financial services, wearable devices detecting and classifying your activities, writing assistance for work emails, and even creative prose or images from generative systems like ChatGPT or Stable Diffusion. We will continue to use the term ‘predictions’ loosely to mean any non-trivial continuation of a pattern in this vein [2].

In each of these cases, the underlying systems are capable of learning non-trivial functions on data to produce models that when fed new data can in turn produce approximations for the function evaluated on the new data points.

As a concrete example, we might be interested in learning to tell when a particular object (or class of objects) appears in an image, and using readily available libraries we can then train an object detection model on many pairs of images and annotations for the presence (or absence) of this object. Our application might simply offer an interface to the trained model from this procedure, where a user uploads an image, and the model produces an output approximating the true function evaluated on the image (this true function is an oracle that is never wrong about the presence or absence of the object in the image), i.e., its best guess on whether the image contains the object or not, and prints its assessment for the user to read.

While modern machine learning systems can be significantly more complex, this example touches on the key aspects of the lifecycle of these systems.

The Machine Learning Lifecycle [1]

Figure 1: The Machine Learning Lifecycle

As the cyclic nature of the illustration suggests, the scoping, design, build and release of this kind of system is an iterative process. What might also come to your notice is that whilst some arrows are one-way suggesting a straightforward dependence, others are bi-directional. This means that in completing an ensuing step, the previous step might have to be revisited and updated.

Let us consider each step below grouped into four basic areas: translation, data processing, model development, and serving. Keep in mind however that these steps need not always be sequential.

We will return to our concrete example for some plausible scenarios to anchor our understanding.

Translation

The translation step is about fixing on the project scope and hence our design scope. It is the time we spend turning over the problem and hypothesising approaches. There are three kinds of constraints here that need to be traded-off against each other: what the business needs, what data is available and how, and what models are available and what they need.

Setting Business Objective(s)

A natural entry point to the diagram is in the setting of one or more business objectives. What is sought in terms of business value? Is the business hypothesis at hand well-posed enough, reasonable, and having sufficient consensus to be worth exploring?

If these questions are satisfied, we can then pose the hypothetical: if an outcome were made available in the form of a ‘prediction’, how accurate should it be and how ought it be made available?

The answer here ought to be more nuanced, considering different scenarios.

In our object detection example, we might ask how much user traffic we might expect, or how many users will be retained with 80% accuracy in detecting the object vs 90%, or how this retention is affected by the latency between their uploading an image and being served an answer, for instance between being on the order of seconds versus minutes. This will inform trade-offs to be made in subsequent steps.

Finally, it is worth asking how success, i.e., how aligned a solution is to actual business value, would be measured once an outcome is available, and where reasonable, how discrepancies can be fed back into the system to better align with expectations.

Constraints on the project itself should also be considered here since they can capture the cost of exploring the business hypothesis. These can include budget and resource availability (both personnel and compute), timeline, and others, and their determination can help scope subsequent steps.

Gaining data understanding

What data is available through what means, whether it is annotated, its provenance and reliability, whether there are material differences in how it’s stored and how it will be initially entered into or captured by the operational system, what related datasets or metadata may be helpful, and any sensitivity around the data are all guiding questions to gaining understanding about the data available.

Moreover, the same dataset can be approached with different objectives in mind to produce different analyses or to pick up on different aspects of it. Looking at the data with the lens of the business hypothesis may even reveal heuristics that can compete with a machine learning based solution. Having considered different scenarios for acceptable outcomes in the previous step, you can decide if such a heuristic is good enough to take to production instead, at least for a first pass.

In our object detection example, if we notice, say, that in images users upload the object we are interested in is almost always the only round one, then we can use fast image processing techniques that don’t need to be trained like edge detection and circle detection for a heuristic instead of training a machine learning model. However, if our understanding of the data changes -for instance, if we start seeing more round objects in the images uploaded by users- we may need to change our tack.

Framing the machine learning problem

The machine learning problem may be addressed through an end-to-end solution or a more modular solution for the output prediction, but in either case the way it is framed will be constrained by both the business objective and the state of the data we can utilise in service of that objective.

Once a formal task is conceived, a suitable validation strategy must be put in place. This will be the testbed for experimentation with different model architectures. For our example, we can make do with randomly assigning annotated images into a train set (say 70% of the images will be assigned to this, which can itself be split into a train set and a validation set, with the latter may be used during training to track performance or to tune (hyper-)parameters) and a test set (the remainder of the images). We must then have a metric based on which we can assess the fit of the model: since we classify here whether the object is contained in a given image or not, we can calculate the counts of true positives, false positives, false negatives (and true negative) and compute for instance an F1 score, or a different metric that trades-off between precision and recall. This single metric will vary between some range (here between 0 and 1) and we can tell whether higher or lower is better (here higher is better). It may not be the loss we optimise for directly in our model training (this may be because it’s not differentiable for instance), but it should track with the loss reasonably well and be intuitive for and have the agreement of stakeholders.

Armed with our understanding of the data from the previous step, we can now also enumerate the possible input features available to us and have a sense of what it would take to access and transform them.

Data Processing

Here we consider both potential data collection where an analytical store or at least an operational store exists, and the transformation of this data for consumption by modelling pipelines. More involved data collection needs may necessitate purpose-build systems and are not considered here.

Building the data pipeline(s)

Data pipelines are intended to get data to training or serving pipelines from their generation or storage points. They need to be correct and robust as the rest of the system is dependent on them, with data quality tests, log generation, error handling and monitoring put in place with due care and deliberation.

Their development needs to consider the access and connectivity protocols supported by source systems and whether the throughput and latency imposed by these support the use case as discovered in the data understanding step. During development, it is important to ensure that schema mappings are predictable or where not are handled appropriately. Where appropriate, anonymisation or pseudonymisation for sensitive data should also be applied at this point.

There may be some back and forth iteration between the developing models step and this as different models require different forms of input and many would also need for issues like missing values or class imbalances to be handled. You may also identify

Often, the difference quality data or just more data can make can eclipse the improvement swapping models can bring. Modern data-centric techniques [3] leverage this, and this is the step where such transformations or augmentations might best be implemented.

In terms of feature engineering, we may need to consider batch features that can be pre-computed and stored as well as real-time or near real-time features that need to be computed on data streams. Features that can be used across different models may be maintained in a feature store to ensure consistency and correctness.

There may also be a need to build separate pipelines for training and inference. In such cases, implementing reusable modules and functions that can be used across both can minimise inconsistencies. There may need to be post-processing of results as well as pre-processing, especially where inferences are required that are readily interpretable by end-users.

Once the pipelines are prepared for transforming data into a format that can be processed by the models to be tested, model development can commence. This too is an iterative process with model-side needs determining the evolution of the pipeline.

In the next phase, we will continue to explore the machine learning lifecycle.

Author

Credits for this article must go to the following author for contributing their time, effort, and knowledge to the Technoid Community.

Author: Yasiru Ratnayake

--

--

Technoid Editor
Technoid Community

Uniting technoids, driving tech progress Powered by Surge Global