Evaluating Model Performance: Key Metrics to Assess Before Transitioning to Production

Stefan Mićić
4 min readMar 22, 2023
https://censius.ai/blogs/challenges-in-deploying-machine-learning-models

In the rapidly evolving world of machine learning and artificial intelligence, deploying models to production is a critical step in the lifecycle of one machine learning model. Ensuring a secure transition requires a comprehensive evaluation of a model’s performance to guarantee its reliability, accuracy and efficiency in real-world scenarios. In this article I will give some suggestions. This article delves into the measurement metrics that could be calculated and analyzed before pushing a model from to production. By understanding and utilizing these metrics, data scientists and engineers can optimize their models, mitigate potential risks and ensure that their solutions consistently deliver value to end-users and stakeholders.

Basic metrics

Upon the completion of each training process, data scientists assess the performance of the recently trained model by computing a collection of established metrics, including precision, recall, accuracy, and F1-score, among others. The significance of these metrics may vary depending on the specific business requirements. While this evaluation serves as a valuable initial benchmark, it is insufficient to determine whether the new model should replace its predecessor. A model with a 95% F1-score, for instance, should not be replaced by one with an 80% score. Therefore, this initial assessment should be regarded as a preliminary hurdle that a new model must overcome before it is deemed suitable for deployment in a production environment.

Inference speed

Consider a scenario in which the existing model boasts a 90% F1-score (assuming it is the most pertinent metric) and the newly trained model achieves a 92% F1-score. It would be imprudent to deploy the new model to production without thorough contemplation. Factors such as latency and throughput could potentially create significant challenges in the system if their performance deteriorates upon deployment.

Latency can be understood as the duration required for a request to be processed (i.e., the waiting time after submitting a request to the system), while throughput represents the number of records that can be processed by the system within a given time frame.

For instance, if there is a requirement to process 100,000 articles per day and each article must be processed within 5 seconds, it is crucial to incorporate these calculations into the evaluation metrics prior to transitioning the model to a production environment.

Resources

Even when the newly trained model demonstrates superior performance in terms of fundamental metrics and speed, there may still be obstacles to address. Consider the following situation.

Suppose an EC2 instance possesses 50GB of storage, with the current model occupying 5GB. In theory, this allows for 10 replicas of the model. However, if a newly trained model exhibits a 3% improvement in F1-score, yet requires 10GB of storage, only five replicas would be possible. This raises the question: is this truly an enhancement? The likely answer is no, as the overall throughput may suffer as a consequence.

Another potential issue is memory leakage. For example, if records are continuously accumulated within a single list, excessive memory may be consumed over time, leading to system instability.

Potential solution evaluation process

One approach to establishing a rigorous evaluation process and mitigating potential issues upon transitioning to production involves the following steps:

  1. Create a consistent hold-out test set that remains unchanged and serves as the basis for evaluating every new version of the model.
  2. As soon as the model is deployed to a test environment (e.g., development or staging), compute standard metrics such as F1-score, precision and recall. Concurrently, monitor memory and CPU usage. For instance, record the consumption levels before and after testing and determine the percentage increase in consumption.
  3. Measure the model’s latency and estimate its throughput.

Ideally, this process should be automated and incorporate a tool for experiment versioning. A suitable technology stack for this purpose may include Python, Kubernetes (K8s) and MLflow. Consequently, the resulting data can be presented in a comprehensive table for ease of analysis and comparison. One example of such table is shown bellow.

Example of experiment tracking

Conclusion

In this article, we have introduced a potential solution for constructing a more secure environment for deploying machine learning models to production. Naturally, there are numerous additional metrics that can be assessed depending on specific requirements. It is advisable to incorporate as many relevant metrics as feasible (without overextending) to confidently determine whether a new model outperforms its predecessor and is ready for deployment.

If you found this article insightful, we invite you to visit our website and contact us for further discussions and exploration of this topic.

--

--