Improving Product Search using Machine Learning

Semantic Query Expansion

Matthys du Toit
Takealot Engineering
12 min readOct 17, 2022

--

During my recent internship in Prosus’ Artificial Intelligence team, my primary project focused on improving the customer product search experience for our portfolio company Takealot. Prosus, being one of the largest technology investors in the world, uses AI and machine learning (ML) to bring value for customers and businesses across its key technology segments: classifieds, fintech, food delivery and edtech. This project showcases a powerful example of how collaboration between our AI team and Takealot’s engineers has led to exciting results that could potentially translate to growth in revenue for the business and improve satisfaction for its customers.

Search at Takealot

Takealot is South Africa’s leading e-commerce retailer, with over 6000 employees across the Takealot Group. Their mission is to be the most customer-centric online shopping destination in Africa. With product search being at the core of their over 3.4 million active customer platform, the search engine they use is essential in helping their customers find the products they are looking for and hence towards the company’s success.

An example of Takealot’s product search experience.

Developing an effective search system is a challenging and prevalent task, to say the least. However, this is particularly the case for Takealot, due to the difficult market characteristics they face in South Africa. To name a few, the country’s large variety of home-spoken languages, various local jargon, and ranges in education levels can make understanding the intention of a customer’s query quite problematic. Takealot’s search system currently adopts a primarily lexical-based approach, meaning the search engine looks for literal matches of the query words and product names or descriptions. Consequently, the system is without comprehensive understanding of the query’s meaning or the customer’s search intent. Such a search system falls short due to several factors that can be improved in several ways. Drawbacks include a lack of an automatic understanding of semantics (hypernyms, synonyms, and antonyms), fragility to morphological variants, and sensitivity to spelling errors, jargon, and translation. Currently Takealot uses a query expansion and replacement mechanism, especially for abbreviations and synonyms, which needs to be maintained manually and as a result can easily become out of date when new queries or products are present. These shortcomings result in critical search issues, with the two most significant being null searches (a query that leads to zero products being found) and retrieval of irrelevant content. Such search issues ultimately translate to fundamental business problems for Takealot; leading to lower discovery rates as customers struggle to find the products they are looking for, and hence a reduction in overall customer satisfaction.

Our project aimed to use ML to incorporate an element of semantic search into Takealot’ search system that would be able to realise the meaning of a customer’s query. The purpose of this would be to provide better matching products to a customer’s search and reduce the probability of null searches. An additional objective was to introduce no large-scale modifications to the existing search system, but to develop a model that could be easily integrated into what Takealot currently have in place. We strongly believe that enhancing its product search in this way will lead to significant improvements in customer’s satisfaction and hence result in a surplus of purchases.

Semantic query expansion

Our proposed solution was to use semantic query expansion. This approach involves finding semantically similar queries to a customer’s search and using these similar queries to find more products that may be relevant to what the customer is looking for. Hence, our problem was essentially reduced to expanding a search query into its multiple synonyms. Takealot’s existing search algorithm would then return a set of products for each one of these queries, along with the original query, to form more extensive and relevant results.

An example of how query expansion can be used in Takealot’s search system to improve product retrieval.

Thanks to Takealot’s readily available data, we had access to the entire previous year’s (2021) anonymised customer search session interactions. This contained user actions such as the queries made, products clicked on, and corresponding client IDs and timestamps. To approach the problem of query expansion, we intended to learn the relationships between user queries based on this past data. We believed that user search sessions best reflected how customers made queries for similar search intentions based on the products they consequently viewed. To follow this intuition, we first had to make two fundamental assumptions.

Firstly, we needed to define our interpretation of a search session. A session is the group of interactions a user takes on the site within a given time frame, which we defined using a time-based expiration approach of 30 minutes. Therefore, after half an hour of inactivity from the same customer, a new session would be defined. Secondly, we assumed that all interactions within a session share a common user search intent. Following this logic, we can say that queries resulting in users clicking on the same product must be similar. By way of example, if three different users search for “trainers”, “sneakers” and “tekkies” (South African slang for sports or casual shoes), and all users consequently click on the product “Adidas Men’s Alphatorsion 2.0 Running Shoe”, then we can conclude that all three of these queries must be related in some way.

How we define similar queries from different search sessions.

To implement this, we proposed to create embeddings of the search session data to form accurate high-dimensional vector representations of queries. These query vector representations could then be used to determine the relationships between different queries within the data, such as the similarity they share (the distance in embedding space). To create embeddings from the search sessions, a neural word embedding model would have to be trained off the data. The focus therefore throughout the project was on developing, training, and deploying a word embedding model that was able to expand queries into semantically similar alternatives by making use of past Takealot user data.

An example of representing search session data as embeddings.

Implementation

A fundamental part of ML is data pre-processing. Almost all the time, the data provided needs to be manipulated before being used as input to a model. In our case, we needed to extract all the useful parts of the search session interactions and clean the queries users made into a standard format. We decided to establish two alternative approaches of pre-processing the queries: method A and B. The key difference between them was that method B would sort words within queries alphabetically, hence reducing the total number of queries in our training set, whereas method A would not.

The pre-processed data then needed to be prepared into the correct format for the selected model. Having researched the top performing word embedding models, we decided to use the Gensim implementation of Word2Vec; the algorithm published in 2013 by Google that learns word representations in vector space. Gensim is an open-source library for many natural language processing functionalities. The Word2Vec model is designed to take a corpus of text as an input, iterate over sentences within the text, and output a set of word vectors for the entire corpus’ vocabulary. In our case, we had to prepare the data into uniformly formatted search session groupings, that contained a sequence of actions: queries made, and products viewed. In search, the sequence of actions made is not relevant, but the constituents of each session are crucial. Considering the sequence in which actions occur would introduce bias, as users generally click on what appears higher up on a search result, rather than what is essentially the most relevant. Consequently, we created unique pairs of queries and products from the session groupings that were then used as input to the model as though they were sentences in a corpus of text. We experimented with five varying modes of constructing pairs from our search sessions: mode 1–5. Once the data was pre-processed and prepared, we were able to begin training some models with different configurations.

Word2Vec learns high dimensional vector representations of queries and products from pairs.

As previously mentioned, the multiple models were trained using the Gensim library. Their embeddings module implements the Word2Vec family of algorithms, using highly optimised C routines, data streaming and Pythonic interfaces. Our models were trained using the Skip-gram Word2Vec algorithm, which operates by training a single-hidden-layer neural network based on the synthetic task of given an input word, giving us a predicted probability distribution of nearby words to the input. Word2Vec accepts several parameters that significantly affect both the training speed and quality. Selecting the optimal values for these parameters was therefore crucial in ensuring that high-quality embeddings were formed. An important one to mention is the min_count, as the model would ignore all queries or products in the data with a lower frequency than this number. Since analysis showed that we would lose an exceptional amount of data if we introduced a frequency threshold (nearly 42% if the minimum count was 2), we decided against introducing such a threshold and set the minimum count to 1. This led to far more nuanced embeddings and a higher probability of new, more irregular, queries made being present.

In the end, we trained 10 different models due to variations in the pre-processing and pair generation methods. All models were trained using the same set of Word2Vec parameters. A summary of the various model configurations is displayed below. Due to various amounts and forms of the search session data being used to train the different models, their total sizes and training time ranged hugely. The two charts displayed below present how large the effect of these variations was.

Overview of the different models trained.

Results

The results we observed from our fully trained models were beyond our expectations. It was genuinely impressive to observe how well the models could generate semantically similar queries to a provided original query, often without any lexical relationship. To give a sense of how semantic query expansion could positively impact Takealot’s search experience for its customers, I have presented an example of how it might work for the query ‘plakkie’.

“Plakkie” is the widely used South African slang term for flip-flops. Currently due to its lexical-based approach, Takealot’s search only retrieves 22 products that are from a flip-flop brand, called Plakkie. The search algorithm therefore does not present any alternative flip-flop or sandal products that would be relevant to the customer’s query, instead returning a small selection of items from a single label. However, if a form of query expansion was introduced the result could be very different. Below I have shown how one of our models generates several semantically similar alternative queries, and that by entering these queries individually to the search algorithm, a more extensive and relevant set of products can be provided to a customer searching for “plakkie”.

To further demonstrate this, I have selected a variety of different queries and will present how the model expands these below. These include a product-type search, queries containing spelling errors, and queries that require translation.

A) Product-type search: “dirtbike plastics” (common term used to describe dirtbike protection plates)

Takealot’s results:

  • Only 1 out of 36 products displayed are relevant to the query.
  • All products displayed are either related to plastics or dirtbikes, but the search algorithm fails to capture the meaning of the two words together.

Our query expansion model’s results:

  • 2 out of the top 3 expanded queries are directly related to dirtbike protective plates.
  • All expanded queries are related to dirtbike equipment or accessories.

B) Spelling errors:

C) Translation: “fiets” (Afrikaans translation of bicycle)

Evaluation

To initially evaluate and observe the performance of our models, we created several visualisations of our embeddings using t-SNE (t-Distributed Stochastic Neighbour Embedding). The technique is used for dimensionality reduction that is particularly well-suited for the visualisation of high-dimensional datasets, such as our query embeddings. We experimented with producing visualisations of both small and large amounts of query data for various categories. If the embeddings were trained accurately, clusters should be visible as semantically similar queries are shown to be grouped closer together due to their similar vector representations. Here are a few visualisations which nicely demonstrate one of our model’s abilities to identify similar queries and group them accordingly:

t-SNE embedding visualization of Takealot grocery related queries.
t-SNE embedding visualization of Takealot wine related queries.

Despite inference from the models proving to yield very successful results, a good evaluative method was still needed to define and compare different model performances with a common metric. We therefore carried out a detailed offline evaluation procedure that consisted of creating multiple test sets of query pairs (1,500 in total), editorially grading each pair according to their relevance, and calculating the ranking metric score of DCG (Discounted Cumulative Gain) for each model by comparing how well it ranked the pairs in comparison to the editorial gradings. A query pair was determined as ‘relevant’ if the person believed that by searching for a product with the alternative query it would share the same intent as searching for a product with the primary query. We used 3 gradings, where a score of 1 represents perfect relevance, a score of 0.5 represents somewhat relevance, and a score of 0 represents no relevance in terms of search intent.

An example of a single set of 5 graded query pairs and the resulting DCG score.

Something we found from experimentation of taking inference from the models was that they performed very differently depending on the type of query that they were given. For example, some models were great at expanding popular queries, whereas others were better at expanding less frequent, more abstract, queries. Therefore, we divided all queries contained within the dataset into three separate groups (“grades”) according to their frequency: grade A (high frequency), grade B (medium frequency) and grade C (low frequency). To evaluate which queries the different models performed better on, evaluation data sets of query pairs from each grading was created. We were then able to effectively determine how the models performed at expanding different types of queries by observing the DCG scores attained by each one. The table and chart below illustrate how largely these scores varied across the models.

Next steps

I am very excited by the potential this project has on improving Takealot’s product search through the encouraging results we have obtained. The models are in a sense a tool, which can be implemented in a variety of different ways, that Takealot’s Discovery machine learning team can use to positively impact the existing search system. Of course, this project can also be extended and improved in several ways, such as incorporating more information from search sessions in the embeddings (e.g., dwell time and implicit negatives). The discovery team can now experiment and test how semantic query expansion will change their users’ behaviour, and what the consequential impact might be. I hope that this project brings Takealot’s search one step closer to better understanding their customers’ queries.

Acknowledgements

This project would not have happened without the support and advice from my team members in Prosus AI. They provided me with great feedback and fresh ideas that brought real value to my research. I would particularly like to thank Dmitri Jarnikov for his supervision during my internship and on assigning me to this very exciting project. Furthermore, I am very grateful for the awesome collaboration we had with Takealot’s discovery machine learning team, led by Pieter Rautenbach. Thanks to them, through their eagerness and enthusiasm to work together, this project was possible. I wish them all good luck on continuing to improve product search for Takealot’s customers!

--

--