Test Automation of Query Understanding AI/ML based Model for an eCommerce Search

Kushal Kumar Verma
Analytics Vidhya
Published in
8 min readFeb 6, 2020

“A year spent in artificial intelligence is enough to make one believe in God.” Alan Perlis

There has been tremendous growth in applying Artificial Intelligence & Machine Learning Models in various fields to solve complex problems that could have not been solved using simple code & algorithms. Hence, it has become a necessity for companies to have a better quality process around this growing technology.

Machine learning (ML) is a subset of artificial intelligence and is defined as the scientific study of mathematical/statistical & algorithmic models that computers use to do a specific task without using explicit instructions, relying on different patterns and inference instead. Machine learning algorithms are build trained using sample data, known as “training data”, to predict and provide output without being explicitly programmed to do required tasks.

Different types of Machine learning models used in solving various problems can be illustrated using the below diagram.

Refer more detailed explanations of the diagram here. Top 10 used ML models. Refer for more details here.

Taking an example of an eCommerce groceries application, this article explains the essentials of the Quality Assurance strategy for testing one of the AI / ML models commonly used in Search.

There can be multiple layers in which these machine learning models could be integrated into eCommerce Search applications. These could be -

1. Before the user input reaches the Search engine.

2. After the Search engine results are received.

Before the user input reaches the Search engine :

The example of such integrations are –

  1. Predictive Text
  2. Autocomplete
  3. Smart Compose
  4. Query Understanding

These could be either based on Deep Learning, Natural Language Processing (NLP), Neural Machine Translation (NMT), sequence to sequence models (Seq2Seq models), etc.

After the Search engine results are received :

The example for such integrations are –

  1. Dynamic Relevancy Tuning
  2. Personalized Search
  3. Related Searches
  4. Similar Searches

These could be based on Add to Cart, Order Conversion Algorithms, Taxonomy Affinity with re-ranking based on Normalisation or Score Computation, Word2Vec embedding using CBOW or Skip-Gram model, seq2seq models with Linear Neural Network (LNN), Facebook’s FAISS.

With these machine learning models coming into the picture, a Quality Assurance Engineer has to make sure that the concept, formula’s and techniques used in models are tested thoroughly so that these models once integrated with the services provides the expected result. Hence, the PDCA cycle has to be used to test & deploy them successfully in Production. Refer link below for more details.

QA machine learning models PDCA cycle

However, these cycles are not easy to maintain as the tester has to come up with different testing techniques to sign off the model’s output. Some of the problems which the testers have to face are –

1. Defining a Test Oracle. Identify the right expected output validators to certify the results. This has to be created or used based on the functionality expected using some utilities or customized functions or Metamorphic Testing.

Metamorphic testing is a technique for creating subsequent test cases based on existing ones, particularly those that have not generated any failure, to find uncovered issues. It’s not an approach for test case selection, instead it’s a method of reusing input test data to create multiple test cases whose outputs can be predicted.

In metamorphic testing, for an input x, if a transformation function T is applied will produce output as T(x). This type of transformation is a metamorphic property of the function based on which an output of f(T(x)) can be predicted with respect to the values of f(x).

Metamorphic test example.

2. Adversarial examples — These are different form of inputs to machine learning models that have been intentionally created for models to make errors. They act as an optical illusion for the program. These are also called as Data Poisoning. In the case of Search queries, you can add certain noise by swapping the input search terms to see if the model can correctly predict the user intentions and provides the required output. Refer for more details here.

3. Features Testing — This plays a pivotal role in testing different aspects of machine learning models. The below 7 criteria helps in the evaluation and conformation of the model’s desired output. Refer for more details here.

a) Feature Thresholds
b)
Feature Relevance
c)
Feature Relationship
d)
Feature Suitability
e)
Feature Compliance
f)
Features Unit Testing
g)
Feature Static Review

4. Defining Test Data — It’s the heartbeat of any test data-driven automation or testing approach. Quality Engineer should ensure the test data partitioning (e.g. Training [65%], Validation[20%] & Test data[15%] sets) is correctly done & the same data set is not referenced & used while the model is being trained. Refer for more details here.

5. Cross Validation technique — In this technique, machine learning model goes through multiple test iterations with respect to data created post data sampling activity. This confirms that the results are consistent, and able to predict it perfectly most of the time. Refer for more details here.

6. Model Evaluation Metrics — There are certain metrics based on which the model’s efficiency & relevancy are evaluated. Key terminologies used here are

a) Threshold
b)
True Positives
c)
True Negatives
d)
False Positives
e)
False Negatives

Using the above key terminologies, the following evaluation metrics are derived. Refer more here.

a) Classification Accuracy
b) Confusion Matrix
c) Precision
d) Recall
e) F1 Measure

So, now let’s start with the Automation approach of one of the above mentioned Search use cases in an Online Grocery eCommerce.

Testing Query Understanding Model :

Below two steps are carried out before the search engine gives a score and ranks its result — namely,

  1. The searcher’s method of conveying an intent in form of query.
  2. The search engine’s method of finding that intent.

Our application under test here is a ML model, which is called Crocodile Model & Entire Intent of the Query. It is created based on artificial neural Networks especially time series NN i.e. Recurrent Neural Networks which uses LSTM (Long-term Short-Term Memory) and CRF (Convolution Random Fields) as a Probabilistic Model.

Reference — https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html

This model could be based on Nutrition, Brand & Category, etc. in an online grocery eCommerce.

Now, let’s take the example of Nutrition based Query Understanding models & how the data should be formed with different permutation combinations of a given string.

Data Generation

Find the user’s base query intent.
E.g. “chocolate”.

Join Prepositions and postpositions ( together it’s called adpositions ).
E.g. with, free, less, no, without, etc.

Join filter query string with adpositions using single or multiple nutritional filter components. (E.g. egg, sugarfree, without egg, with soya, non-veg, no gluten, free from, nuts free, etc.)

Output user query string = “eggless chocolate with no sugar”.

Data Generator using Permutation of a given input

To generate the data set, build custom randomization functions using Recursion Tree for Permutations with the given Base Query & Filter query & Adpositions. The filter queries could also be like Brand, Category, etc. wherein Category follows a different model & test approach. This will be covered in the next part of testing Query Understanding models.

Using Recursion Tree

Hence, the data sets could be classified into the following manners.

1. Base Query + Preposition + Filter Query (Nutrition or Brand)*

2. Base Query + Filter Query + Postposition

3. Filter Query + Postposition + Base Query

4. Filter Query + Base Query + Postposition

5. Preposition + Filter Query + Base Query

6. Preposition + Base Query + Filter Query

7. Spell mistake at Base Query or Filter Query components or Adpositions or all of them.

8. With or Without Space between Base Query & Filter Query components.

9. Use No, Non, Less, Free From, With, Without, And, Special Characters ( , — etc.) in between Base Query or multiple Filter components (E.g. Nutrition here).

Automation Flow Chart

On a high level, we followed the below approach to validate test results & evaluate the model’s output of Nutrition based Query Understanding.

Automation Flow Diagram

Our Model Validator tests start from Random Query Generator which is the search term generator method using the Recursion tree for permutation of the Base query, Filter query & Adpositions explained above. This acts as an input to the Data Provider. The test then makes a REST call to the Model Service which is our AUT (Application Under Test).

The Filter Attribute validator is then called which is a custom function written acting as a Test Oracle to generate the expected output based on the user query created by the tester. Once we have the actual & expected values, this is passed onto isMOPcorrect (is model output correct) method for comparison between the actual output from (Model Service) vs expected output generated from (Filter Attribute validator) method. This method also helps in calculating True Positives, True Negatives, False Positives, False Negatives counts w.r.t automation test data as per the Threshold set.

Based on the above parameters for each test, the data is shared with Model Evaluator which evaluates the model’s metrics like Classification Accuracy, Confusion Matrix, Precision, Recall, F1 Measures. Hence, these metrics will give the confidence on the model’s desired output & comparison between the Data Scientist Trained Data output & QA test data and fix any defects/shortcomings or limitations of the model.

Conclusion

With this article, we understood the important gears of testing AI/ML based Model and Query Understanding model of an eCommerce Search. The further aspects of testing different AI/ML use cases for an eCommerce Search will be continued in the next part of this blog.

Happy reading !! and appreciate your patience if you have read until here. Please do give feedback & stay tuned for more to come..!! :)

References & Inspirations

[1] Sonu Sharma,(2019), Understanding the Search Query

[2] Sonu Sharma,(2019), [Crocodile Model]& [Entire Intent of the Query]

[3] Murphy et al (2008)

--

--

Kushal Kumar Verma
Analytics Vidhya

Quality Engineering Evangelist || AI ML Enthusiast @WalmartLabs