Chapter 1 — Getting started Full stack ML course (Part 3: Model Creation, Model Assessment, Model Deployment.)

14 min readMar 31, 2024

--

In the previous two segments of Chapter 1, I aimed to establish a foundation for upcoming practical chapters, which will feature notebooks and code examples. However, before delving into practical applications, it’s crucial to familiarize readers with key concepts and provide insights and examples to develop intuition. In Part 1, I briefly introduced the model lifecycle, while in Part 2, I delved deeper into specific components such as data acquisition and exploratory data analysis (EDA), demonstrating how the type of data collected influences the output. Additionally, I highlighted some noteworthy Kaggle gems, which serve as exemplary EDA notebooks.

In this segment, I will focus on introducing model creation and, more importantly, model assessment. I will try to develop some intuition about deep learning model designs for various tasks. Then we will get into what are all the components required in model creation. I also will illustrate with an example where model assessment revealed a bug in our pipeline.

3.4 Model creation

Honestly, while this step receives significant attention from the research community, it tends to require the least amount of time in practice. Nonetheless, the effectiveness of your approach heavily relies on this stage, provided you have quality data. In practice, there are numerous tools and methods at your disposal, and deciding which ones to employ largely depends on various factors. Determining whether to utilize machine learning (such as logistic regression, tree-based methods, SVM, etc.) or deep learning (including CNN, Transformers, etc.) is another aspect to consider. Additionally, understanding the key components of model development is crucial. These are among the questions we aim to address.

Model types available for tackling a problem

Consider all the methods that you studied in ML101 or Intro to Machine learning, or ML Specialization from deeplearning.ai or whatever MOOC you took as tools in a Swiss Army knife. Which one to use is dependent on various factors like

Amount of data
Feature set
Complexity of feature
Hardware available

I won’t delve deeply into this topic here, as a better understanding will come with familiarity with the methods and hands-on practice. However, here’s a brief overview.

Amount of data

When dealing with limited data, it’s wise to steer clear of deep learning techniques to avoid the risk of overfitting, given their higher parameter count. Instead, you might opt for machine learning (ML) methods like logistic regression, tree-based models, or support vector machines (SVM), weighing their respective advantages and disadvantages. This caution arises from the concept known as the curse of dimensionality. Essentially, it posits that as the number of features increases, the necessary number of examples grows exponentially. In sparse, high-dimensional spaces, models are more susceptible to overfitting. For a clearer grasp, take a look at the image below.

Illustration illustrating the “Curse of Dimensionality”. On the left: An image containing one data point and two features, showcasing multiple potential lines as solutions. However, these lines may not accurately represent the solution for the validation set or production data. On the right: Similarly, the second image depicts two data points and three features, presenting multiple planes as potential solutions, which may not be appropriate for the validation set or production data. The idea is, to represent a problem properly, there shall be large number of data points available in training dataset for a method to learn about the problem. Otherwise, too few datapoints can be easily fitted and may not be correct solution.

Feature Set

If you’ve already defined features or crafted them manually and aim to discern relationships with the output variable, opting for a machine learning (ML) based system is advisable. ML algorithms are particularly well-suited for tabular data, and many Kaggle competitions centered around tabular data still rely on ML methods. Conversely, when you lack a robust feature set, employing deep learning (DL) methods is preferable as they can learn features autonomously. In essence, for structured data, ML methods are preferred, whereas DL methods may be chosen for unstructured data. However, it’s important to note that this is a very general guideline and may vary depending on the specific circumstances.

Complexity of features

This corresponds to the previous point, wherein deep learning methods are employed when feature engineering is highly intricate.

Hardware Available

Deep learning methods demand substantial computational resources, primarily because they heavily involve matrix multiplication operations. Utilizing GPUs or other forms of accelerated computing is often necessary to meet these computational demands.

3.4.1 Understanding various deep learning model designs

I won’t delve into specifics here, but let’s aim to develop an intuitive understanding of how models are designed for different use cases. Now, let’s explore this section more deeply.

Classification Network

Classification networks represent one of the simplest design paradigms. To grasp their design, it’s essential to establish certain parameters specific to the task, a checklist we’ll adhere to for other use cases as well:

Input: The input comprises unstructured data, such as images, text, sound data, or time series.
Output: The desired output is a category.
Objective: The goal is to classify the provided input and produce a category corresponding to the example.
Design Overview: Initially, we encode the given data into a hidden representation consisting of floating-point numbers. Subsequently, this representation is fed into the classification head, which typically constitutes a linear layer with optional activation functions to produce output probabilities within the range of 0 to 1.

Object Detection

Object detection networks represent a more complex design compared to classification networks. To delve into their design, it’s crucial to establish specific parameters tailored to this task, following a checklist similar to that of classification networks:

Input: The input typically consists of images or videos, where the goal is to detect and locate objects within the scene.
Output: The desired output includes bounding boxes specifying the location of detected objects along with their corresponding class labels.
Objective: The aim is to accurately identify and localize objects within the input image or video.
Design Overview: Initially, the input data undergoes feature extraction using a convolutional neural network (CNN) to capture relevant spatial information. Subsequently, these features are passed to a region proposal network (RPN) or similar mechanism to generate candidate object bounding boxes. These proposals are refined and classified by a classification head and regression head, respectively, to produce the final predictions, consisting of bounding boxes and class labels for detected objects.

Segmentation Task

For segmentation tasks, the network architecture is tailored to the specific requirements of segmenting objects within an image. Here’s an outline akin to the previous ones:

Image showing components of segmentation network which first downsample the image while increasing the number of channels, and a decoder which upsample the intermediate result to original resolution in one channel.

Input: The input comprises images where each pixel needs to be assigned a class label or a category indicating the object or region it belongs to.
Output: The desired output is a segmentation mask or a pixel-wise classification map that delineates the boundaries of objects or regions within the input image.
Objective: The goal is to accurately delineate and classify each pixel in the input image, thus partitioning the image into meaningful segments corresponding to different objects or regions.
Design Overview: For segmentation tasks, the network architecture typically involves an encoder-decoder framework tailored to the specific requirements of segmenting objects within an image. Initially, the input image is processed through an encoder, which extracts hierarchical features through a series of convolutional layers. These features are then decoded or upsampled by the decoder to generate a dense pixel-wise prediction map. The final output is a segmentation mask where each pixel is assigned a class label representing the object or region it belongs to.

Translation Task

For translation tasks, the network architecture is designed to translate text from one language to another. Here’s a brief overview:

Image showing a dialogue from the series “Narcos” being translated to english. There are two components. An encoder which encodes the spanish sentence, which is then fed to decoder which iteratively translates the sentence to english.

Input: The input consists of text in one language that needs to be translated into another language.

Output: The desired output is the translated text in the target language.

Objective: The goal is to accurately translate each word or sequence of words from the source language to the target language.

Design Overview: Translation tasks often utilize an encoder-decoder architecture specifically tailored for text translation. Initially, the input text is processed through an encoder, which converts the text into a numerical representation capturing its semantic meaning. This encoder typically consists of recurrent neural network (RNN) layers or transformer layers. The encoded representation is then fed into a decoder, which generates the translated text in the target language. The decoder utilizes attention mechanisms to focus on relevant parts of the input during the translation process. Overall, the encoder-decoder framework enables effective translation by converting text from one language to another while preserving its meaning and context.

3.4.2 Ingredients of a Model training routine

Visual representation displaying various elements within a model training framework. If you are 90s kid with some memories of cartoon network. Then you would recognise this image. I hope this triggers nostalgia in you.

There are several components essential for model training, and I’ll outline some of them here. These modules typically form the foundation when training a deep learning model, and I consistently refer to this checklist when building a model, especially within a PyTorch Lightning module, which I highly recommend for anyone embarking on model training. Now, let’s dive into the checklist:

Data / Dataloader: The data loader is a class that utilizes a dataset class, responsible for tasks such as reading, augmenting, and returning the data, and then collating them into batches, either randomly or according to user-defined logic.
Model: This is the primary model class defining the model architecture and the forward pass. Essentially, it encapsulates the logic for processing input from the data loader and executing the forward pass.
Loss function: This component is pivotal, dictating how incorrect predictions are penalized and guiding the learning process.
Metric: Metrics provide crucial quantitative insights into the model’s performance. Initially, one might wonder why both loss and metrics are necessary. While the loss function is designed to be differentiable for optimization purposes, metrics offer a non-differentiable evaluation of model performance, closely aligned with the task at hand.
Optimizer: The optimizer algorithm manages gradients during the backward pass. Typically, a standard optimizer is employed across various tasks, influencing the speed of convergence by minimizing oscillations and facilitating faster convergence on the same loss surface. Here is the link that gives a perspective into this. https://miro.medium.com/v2/resize:fit:720/format:webp/1*47skUygd3tWf3yB9A10QHg.gif
Scheduler: This component governs the management of the learning rate throughout the training process, determining whether to maintain it constant or adjust it according to a predefined schedule.

3.5 Model assessment

Alright, let’s discuss what to do after training your model. You’ve achieved convergence and obtained satisfactory or even improved results on the validation data compared to previous experiments. But can you confidently deploy your model based solely on these metrics? In this section, I’ll share my experiences and lessons learned from training numerous models and making mistakes along the way. It’s all part of the learning process.

So, what steps should you take after your model is trained? Here are some checks to ensure you understand what the model has learned:

1. Check Your Metrics Implementation

It’s essential to double-check your metrics before training your model. Make sure your metrics are working correctly by testing them with the actual data and comparing the results with the expected outcome. This step is crucial because sometimes open-source implementations can change unintentionally, causing differences in how metrics perform. By verifying your metric implementation, you’ll feel more confident and better understand how loss affects the metric. Ideally, you want to see a negative correlation, as this indicates optimal metric performance. Meaning, as the loss reduces, metric shall increase. If its not the case then it may mean two things

Either loss is optimal for learning the examples
Or wrong metric is used, and hence even if the loss is reduced, metric does not represent our objective. (for eg. loss for classification model may be going down, and let us say you have defined some metric ‘ugx-metric’ which is not improving. This means ugx is not inline with loss, and what model may be learning is not represented by the metric)

2. Evaluate Metric Selection

Again, this should have been done beforehand, but it’s essential to reinforce. Consider the example of a classification task for a rare disease like cancer, where class imbalance is prevalent. For instance, if only 1% of a dataset of 1000 people are cancer-positive, and the model correctly identifies just 2 individuals while marking 989 as true negatives, accuracy may yield a high percentage (99.1%). However, this does not reflect the true performance, as only 2 cases were accurately identified. In contrast, using a metric like f-measure would yield a lower value (around 0.3), indicating the actual performance more accurately. It’s crucial to ensure that your metrics align with the business requirements.

Consider the following examples, each with different metric requirements:

Spam Email Detection: In email filtering systems, correctly identifying legitimate emails (negative class) is vital to prevent false positives, prioritizing precision to avoid marking important emails as spam, even at the expense of potentially missing some spam emails (lower recall).
Search and Rescue Operations: In search and rescue missions, maximizing recall is essential to ensure no survivors are overlooked, even if it means including some false alarms (lower precision) that may require further verification.

3. Don’t trust the metrics alone

Let’s delve into the crux of this section. Once you’ve finalized your metrics and gained confidence in their reliability, it’s crucial to exercise caution when interpreting metric values. What does that mean? Well, at times, metrics can be misleading. Now, you might wonder if this contradicts my earlier point. Yes, it does, but hear me out for a moment. After completing the training phase, follow these steps:

Verify that your training and validation metrics are closely aligned, indicating minimal overfitting. It’s natural for validation metrics to slightly lag behind training metrics, but there shouldn’t be a significant difference.
Evaluate your metrics on the test set. Is the value substantially lower than the validation metrics? If so, examine the test data. Does it closely resemble the validation data, or are there noticeable differences? This comparison will provide valuable insights.
Also, scrutinize if both validation and test metrics are exceptionally high, approaching an ideal scenario. While not always indicative of a problem, it’s worth investigating further. Plot your predictions to assess the model’s performance visually. You might uncover discrepancies where the metrics don’t align with the actual predictions.
If everything seems fine, conduct additional testing on a subset of the validation data. Infer on each example, creating a pandas dataframe pairing each sample with its corresponding metric. Utilize ipywidgets for interactive visualization to observe how the model performs. Personally I use weights and biases . Categorize examples based on performance metrics (good, moderate, and poor) to identify areas for improvement. Plot examples where the model struggles to provide accurate predictions, allowing for in-depth analysis of its performance. Example, look at this wandb report, it shows how you can do analysis on a classification model. Please go through this link, it will give you a new perspective about the model assessment https://wandb.ai/stacey/mendeleev/reports/Tables-Tutorial-Visualize-Data-for-Image-Classification--VmlldzozNjE3NjA?galleryTag=computer-vision . I use wandb reporting when I need to showcase my results to a larger team of data science people.

Visual depiction illustrating a typical fetal ultrasound alongside an abnormality characterized by a lemon-shaped skull.

Allow me to share an instance where the above checklist proved invaluable. This methodology was born out of a specific incident. I was tasked with classifying lemon-shaped skulls, an anomaly that occurs in a few infants and can be detected during a second-trimester scan. Upon training the model, I was pleasantly surprised to achieve impressive results, boasting a validation dataset F1 score of 99%. However, a pivotal moment ensued before our presentation to upper management. We decided to test our model on textbook examples of lemon-shaped anomalies downloaded from the internet. To our dismay, the model not only failed to classify these examples correctly but also misidentified negative samples, i.e., normal skulls as lemon-shaped anomalies. To address this issue, we employed a technique called GradCam, which generates a heatmap highlighting the pixels influencing the model’s prediction. To our astonishment, the model was not focusing on the lemon-shaped anomalies but rather on a faded annotation in the top right corner that read “lemon shape detected.” This annotation had been overlooked during training, potentially leading to a humiliating outcome during our presentation.

a) Image displaying select samples from the dataset featuring lemon-shaped skulls. (b) Illustration showcasing the Grad-CAM visualization of these dataset samples. The highlighted red region indicates the focal point of the neural network’s attention for the output. Surprisingly, it is not emphasizing the characteristic skull pinching associated with lemon-shaped skulls, but rather focusing on the bottom-right area. ( c ) Zoomed-in view of the bottom-right section of the image, revealing a textual annotation present in those specific samples.

3.6 Model deployment

Image showing two ways to deploy your service as an API.

Model deployment is the crucial step in the full stack model development process, where the trained machine learning model is made accessible for inference by integrating it into a production environment. One of the simplest ways to deploy a model is by using frameworks like Flask, which allow you to create web APIs to serve predictions. Another option gaining popularity is TorchServe, a PyTorch model serving library, which simplifies the deployment process by providing features like multi-model serving, model versioning, and metrics monitoring. TorchServe offers advantages such as scalability, ease of deployment, and compatibility with various deep learning models. However, regardless of the deployment method chosen, it’s essential to consider factors like latency (P99), which represents the worst-case response time experienced by users, and concurrency, which refers to the number of simultaneous requests the API can handle efficiently. Ensuring optimal performance in terms of latency and concurrency is crucial for providing a seamless user experience. We will explore a bit on this in later chapters.

3.7 Model Monitoring

Monitoring your machine learning models is crucial for keeping them reliable in real-world situations. One big challenge is spotting data shift, which happens when the kind of data coming in changes over time, making the model perform worse.

In real life, we check how well the model’s predictions match reality and how it behaves over time. If there are big differences or unexpected changes in how well it’s doing, we know something might be wrong. Regular monitoring helps catch these problems early and lets us adjust the model to keep it accurate.Monitoring tools keep an eye on metrics like accuracy, precision, and others to see how well the model is doing. They also look at things like how fast the model responds.

Deciding when to update the model depends on a few things, like how much the data has changed, how it’s affecting the model’s performance, and if we have the resources to update it. If the model isn’t performing well or the data has shifted a lot, it’s probably time for an update. Keeping an eye on the model and updating it when needed makes sure it stays reliable for real-world use.

4. When Not to use ML

Guidelines for determining when to avoid using Machine Learning:

Opt for traditional software development if it can solve the problem at a lower cost.
Avoid Machine Learning if obtaining accurate data is overly complex or unfeasible.
Steer clear if the task demands a substantial amount of manually labeled data.
Reconsider if the potential cost of errors made by the system is prohibitively high.
Avoid models that lack the need for frequent improvement or incremental learning over time.
Reconsider if the interpretability of every decision made by the model is essential.

5. Conclusion

In conclusion, the journey through the typical machine learning lifecycle, from problem identification to model deployment and monitoring, is a dynamic and iterative process. Each stage presents its unique challenges and opportunities for learning and improvement. Problem identification sets the stage for the entire process, guiding the direction of data acquisition, exploratory data analysis, and ultimately model creation. Model assessment ensures the quality and reliability of the developed model, paving the way for successful deployment into production environments. However, deployment is not the end of the road; ongoing monitoring and evaluation are essential to ensure that the model continues to perform effectively over time. By embracing the iterative nature of the ML lifecycle and incorporating feedback from monitoring, practitioners can continually refine and enhance their models, driving greater value and impact in real-world applications.