Quality Assurance in a Machine Learning Environment
Rapid technological changes have brought the necessity of quickly adapting this dynamic world. Rather than providing a service, the main question evolved how the service can be provided optimally in terms of speed, quality and customer-oriented manner. Therefore, they have become the crucial factors to take action, especially for the companies that monitor user activities closely, such as e-commerce. This is where the machine learning, or artificial intelligence, comes into play. As it becomes impossible to manually infer valuable information from a huge amount of data, feeding a machine learning algorithm with data enables all business processes to be easily automated. All these smart processes allow companies to provide faster, qualified and personalized services to their customers individually.
As shown in the below image, whole process starts with addressing a business problem at the beginning. After deciding the proper dataset and machine learning / deep learning model for a particular problem, the model is evaluated based on a training objective. Offline metrics is the stage where model performance is analyzed through the common (i.e. accuracy, precision, recall, hit rate) and/or self-determined metrics. For instance, accuracy is simply a ratio of correctly predicted observations to the total observations while the precision indicates how much the model is right when it says it is right. Even though each step up to this point gives insight about the quality and performance of our ML model, they are not sufficient. Model should go through an experimental analysis phase where the results are examined from different aspects, especially from the customer-oriented perspective. According to the decision taken here, the model may be deployed on the production by going through post-processing and filtering (1), it can be returned to the offline metrics (2) or model & training stage (3) and evaluated over different metrics, or it may be necessary to update the dataset (4),which the model is fed by, with a data-oriented approach. Depending on your model type, changes can be made in the model phase are either modifying the hyper parameters, changing the objective function or the entire model structure.
Let’s say model is deployed on the production after a particular post-processing and filtering actions. This time, measurement continues with the online metrics such as CTR (click-through rate) and CR (conversion rate). This is where we can observe the short-term customer experience. CTR is simply clicks divided by the views, while CR represents the order ratio after products being clicked.
As the flow progresses, it becomes difficult not to deviate from the selected business problem and desired goal. Here, quality assurance is very important in order not to get lost in this journey. QA can be included in this flow at many points: data preparation, the entire experimental analysis, and short-term and long-term metrics tracking. In this article, the main focus is one of the most important stage, namely the experimental analysis. Experimental results are very important in demonstrating the quality and success of a model, since they are easier to be interpreted and understood by everyone. Thanks to the inferences we have made here, we can get an idea about which one gives better results by modifying the parameters we use in our machine learning model.
Let’s consider a text similarity model which is trained by using some parameters for a recommendation system. The business problem and regarding goal is to recommend products having similar attributes with the source/main product. Correspondingly, dataset contains product names, category information and product attributes. Assume that we selected “cross-entropy loss” as the training objective which is basically multiplication of the target and prediction labels, and “same category ratio” as the offline metric which represents whether the recommendations have same category with the source product or not. A higher “same category ratio” metric indicates that our model may produce better results, since products in the same category will have more common attributes.
We have a text similarity model trained by using different parameter values.
Note that offline metrics can also include the metrics such as accuracy, precision and hit rate. Here, only “same category ratio” offline metric will be discussed. As can be seen from the image below, “same category ratio” of the second model is much better than the first trained model. When we made a conclusion just by looking at this value, obviously we think that the second model is better.
After offline metrics stage is completed, in the experimental phase, there are some points to consider when choosing a sample set. First of all, what purpose the model serves should be clear before evaluating the quality of the model results. This may be to recommend alternative products or complementary products instead. In line with this objective, sample set should be selected from as many different categories as possible, from different price ranges and with different characteristics. Returning to the above text similarity model, the purpose is to provide alternative products to customers by using attribute similarity. The sample product set was selected from different categories and types, having more or less number of attributes, and the recommendations for these products were examined.
It is obvious that this text similarity model will learn and work better in multi-attribute categories, since the inputs you give to the model will be more in them. Let’s take the computers category. Products in the computer category contains many common attributes such as screen size, display card, processor, ram, hard-disk capacity and resolution.
When the results of two models trained with different parameters were compared on the same sample set, it was observed that both of the recommendations produced by these models have approximate results in terms of attribute and category similarity. However, when results are analyzed from a different aspect, namely in terms of price, it was clearly seen that the price range of second model (with higher same category ratio) recommendations diverges from the main product’s price. Recommending products well above the source product’s price will likely lead to loss of our customers’ interest. In fact, price comparison is not within the scope of our business problem and goal; but an unexpected negative output of the model which can only be realized thanks to the experimental process. Please note that the price information is also not included in the dataset. At this point, it may be possible to proceed in four different ways as follows.
Suppose the first way (1) is chosen. If the first model’s results are good enough for the production, we can move forward with it. In this case, it seems more logical to continue with the first model, even though its offline metric is lower than the second one. Additionally, as a post-filtering process, price calculation can be added also for the first model to prevent high-priced recommendations, even if not encountered during the experimental phase. When the second way (2) is followed, price-related offline metric can also be added. The reason for adding this price-related metric is to understand if the current model has an overall problem with the price data or just a bias in the selected experimental samples. The decision to be made here will enlighten the question of whether to continue with the post-processing or return to previous stages in the flow. By selecting the third way (3) either model’s objective function or the whole model type can be changed. On the other hand, using a data-driven approach, dataset can be updated to include price information along with other data by following the fourth way (4). During the dataset preparation, negative samples can also be given as input so that the model can learn better. This allows us to bring price awareness to our ML model.
What is mentioned here is just one example to highlight the importance of experimental analysis. Depending on the purpose of your model, it will be necessary to investigate the model from different aspects. It should be always kept in mind that the final decision to be made will directly affect the online metrics such as CTR and CR which are fed from your customers’ actions. In fact, although we can infer the success of our model from the metrics and experimental results, it is probably not enough for us to make the final decision. In order to measure the model success from the customers’ point of view, we can continuously improve our model with the outputs we have obtained through the A/B testing strategies. Short-term online metrics such as CTR and CR can also be deceptive. For instance, an increase in the customers’ click counts on the recommended products also increases the CTR metric; however, it would not be correct to conclude that the model is definitely successful based on this. It would be beneficial to support them with other metrics such as diversity, coverage, and serendipity. The more important thing is aiming to optimize long-term metrics that will increase the customer engagement.
In conclusion, during all these processes quality assurance has a great responsibility in both following the desired goal from beginning to the end and interpreting the model outcomes from the customer’s perspective.
Eventually, the most critical indicator that will reflect the success of your model will be the short and long-term user reactions to whom you will provide the service.