Utilizing GAN Models for Query Expansion

Mert Gurkan
Insider Engineering
10 min readAug 8, 2022
Thanks to Nonteek for their generic but beautiful AI banner

A well-performing search engine is important to receive successful traffic for any website. It is even more crucial for the companies in the e-commerce industry as they aim to make their products visible to their customers as much as possible.

The cold start problem is a well-known problem for search engines and recommendation systems. The problem can be defined as the inability of search engines or recommendation systems to produce accurate results for search queries that were not introduced to the system before. As these systems often rely on previous occurrences of user actions and their results, the lack of references hinders the system’s ability to generate inferences for the user.

The cold start problem, a present problem in our case, is one of the problems that can negatively affect the performance of search engines. There are various methods to address this problem with machine learning and deep learning techniques. This post summarizes our recent research to address the cold start problem with an alternative approach by utilizing the Generative Adversarial Networks.

Research and Problem Definition

The Generative Adversarial Networks (GANs) propose a different methodology for training a machine learning model. Traditionally, training machine learning models usually involve evaluating the training performance of the model at hand with the validation sets. For these operations, the model in the training process is introduced to the validation data.

Unlike the model training process described above, the GAN structures consist of two different models called the generator and the discriminator. While the generator model is responsible for creating synthetic outputs from the given data instances, the purpose of the discriminator is to distinguish these synthetic examples from the real data instances. With this structure, the main difference in the training process is utilizing the discriminator model as the feedback mechanism for the performance of the generator model.

This process often takes place after a pre-training of both models. The generator model is trained with the real data instances. The discriminator is trained with both real and generated data instances. After the pre-training process, the generator and the discriminator model compete against each other within a minimax game.

The objective function of adversarial learning between the generator (G) and the discriminator (D)

As the objective function above also depicts, the generator model aims to minimize the likelihood of generated instances classified as fake examples by the discriminator. Therefore, it aims to maximize the probability of generated examples as the real data. The discriminator, however, aims to maximize the real data instances as the real data, thus minimizing the likelihood of confusing generated examples with the real data instances.

In recent studies in the literature, GAN models are employed more in the image generation and computer vision domains. For these domains, the generator of the GAN structures often transforms the input image into another image. Then, the discriminator model is used to classify the generated images and real image examples. As a well-known example, the StyleGAN architecture published by Nvidia Labs can be used for generating real-looking images in different domains thanks to the transfer learning approaches.

Generated synthetic images of human faces with the StyleGAN, image is taken from https://github.com/NVlabs/stylegan.

Another well-known GAN model for image translation is the CycleGAN model. Without delving into more details, the reader is encouraged to check the magic of transforming horses to zebras or landscape photographs into Monet paintings.

The examples above demonstrated the success of GAN models in the computer vision domain. However, utilizing this adversarial learning structure is not only limited to image-based problems. Although not as popular as image-based problems, GAN models are employed for natural language processing domains as well. The lack of popularity of these studies can be attributed mainly to the following two reasons. First, language models and transformer models are already successful solutions to text generation and machine translation problems. Second, the discreteness of the text data raises complications for a GAN model to carry the feedback from the discriminator to the generator model.

Unfortunately, the second issue mentioned above raises problems that need to be addressed with additional mechanisms introduced to the adversarial learning structure. The way this problem is solved will be discussed in the next section while detailing the proposed GAN model.

We hope we had a sufficient section to get ourselves with the GAN models. Now, let’s get back to our problem!

In our case the cold start problem is present as the unsatisfactory performance of matching less common user search queries with the products. Hence, the desired outcome of the project was to obtain sequences generated based on user queries to improve the matching performance with the products in the customer databases. We chose a GAN model to address this problem for the following reasons:

  • A generator model allowed us to obtain expanded word sequences from user queries. The intuition here was that expanding problematic user queries with words related to these queries would increase the likelihood of matching them with items in the product databases.
  • A discriminator model creates a way to evaluate the sequences generated by the generator model. This way the evaluation is not only performed by the evaluation metrics of the generator model, but also by a model designed to distinguish read and synthetic sequences.

To this end, the following section presents the proposed GAN model.

The Proposed GAN Model

Before detailing the generative adversarial model, let’s also describe the datasets used in this research. Our datasets consisted of the information of user search queries and their items in the product databases. We gathered anonymized datasets in this format for selected pilot partners. Thus, the final datasets did only include the search query performed by the users and the matching products. With the datasets of the search query and the matching product pairs, our GAN architecture was as follows.

The generator model of the proposed GAN architecture

The generator model of the architecture is an encoder-decoder model that can produce expanded sequences from given lower length input sequences. Both the encoder and decoder parts of the model are built with LSTM cells.

As the first step of the generator, the input data as user search queries are introduced to the generator model. The encoder-decoder generator model transforms the user query into a fixed-length encoded representation. This fixed-length representation carries the semantic and order information of the given input sequence. It is then decoded into an expanded sequence by the decoder part of the model. Here, two different decoding methods are utilized. During the pre-training of the generator model, decoding is performed in a word-by-word iterative fashion. Words in the sequences are obtained with softmax operations and cross-entropy loss. During the adversarial learning process, the decoder does not produce full-length sequences. It employs Monte Carlo simulations to rollout for complete sequences during intermediate steps.

It means that the generator model simulates complete sequences from incomplete sequences in an iterative manner. During the simulation process, the generator creates multiple sequences with the current state. States in this context can be defined as the generated sequence so far and the possible word choices to expand it. The discriminator evaluates these as batches to provide an average reward for the generator. The average reward informs the generator about the current state and acts as the back-propagation mechanism for the generator model. The more the discriminator model is confused about the simulated sequences, the generator model is rewarded for its word choices and updated in the right direction.

The chosen discriminator model is also an LSTM based classifier model. Comparing model performances and hyper-parameter optimization demonstrated that a model consisting of two LSTM layers followed by a linear layer performs the best for our problem. The diagram of the discriminator model can be observed below.

The discriminator model of the GAN architecture

As our generator model was more complex compared to the chosen discriminator model, for many test scenarios we observed that the generated sequences were not successfully classified by the discriminator model. To balance the performance differences between the two models, we decided to use pre-trained word embeddings as the input layer of the discriminator model. To this end, we used the Turkish word embeddings provided by the FastText library. Using word embeddings in the discriminator model allowed the model to capture the context and similarity of words in the sequences. This way, we were able to increase the performance of the discriminator to the point where it could produce meaningful feedback for the generator.

Results We Obtained

The problem definition of the project was matching expanded user queries with products in databases for customers. To this end, a homebrew performance metric named word coverage was also used for evaluating various trained GAN models. The word coverage metric is defined as the average ratio of the number of different words to sequence lengths. Its similarity with the BLEU score can be noted as the metric is inspired by it to produce a straightforward evaluation between generated sequences and vocabularies obtained from the datasets.

Compared to conventional performance metrics or monitoring the loss of the model, the usage of word coverage allowed us to observe the diversity of generated sequences in terms of word choice of models. One additional advantage of using this metric was also the fact that it was not dependent on vocabulary sizes.

Validation set word coverage metric over epochs for a trained model. The x axis depicts the epochs and the y axis depicts the word coverage metric as percentage.

For successful models, we were able to obtain this metric close to 100% after the adversarial training process. Achieving a high word coverage can be interpreted as increasing the likelihood of matching user queries with items in the customer databases. It is due to user queries being expanded with only in-vocabulary words as a design choice for the models. Thus, a model with a high word coverage is more likely to generate sequences that will enable the user search to match with items in the database. Additionally, obtaining a metric higher than 100% points out that the model is likely to expand the user queries in the validation set with the words that are not part of the validation set.

Things We Learned!

Conducting a research project on a fairly complex subject allowed us to gain new perspectives on the challenges of the problem. During this project;

  • We explored potential solutions for the back-propagation problem that GAN models have for text data. Among the possible solutions, we chose to utilize the Monte Carlo rollout technique as it involved the evaluation of the sequences by the discriminator, hence a suitable method for our architecture.
  • We had a chance to test and monitor different deep learning models for both the generator and the discriminator model. The results of these trials demonstrated that LSTM based models were outperforming the alternative models. In addition, compared to most of the studies in the literature, we utilized a more complex model for our generator. Because of this, in our cases, the generator model tended to outperform the discriminator model. The opposite is usually reported by the studies in the literature.
  • We came up with custom evaluation metrics that better suited the problem at hand. The realization of dataset size affecting the performance metrics for models trained with different datasets led us to derive a metric that was not reliant on the underlying dataset. With the word coverage metric, we were able to compare the performance of models trained with datasets gathered from different partners.
  • We also studied the ways of generalizing the proposed GAN model. Ideally, we wanted to obtain a singular model that can address the cold start problem for different partners by being trained with a combined dataset of multiple partners. However, we observed two crucial problems with such an idea. The first problem was the required data size for a model that needs to be trained. We realized that we needed a larger span of data collection period if we were to train a model with the premise of generalization capability. The second problem, a more practical one, was due to our prototype generator models demonstrating worse performances when we introduce combined datasets in the training phases.

Further Resources

This project has been a journey with lots of learning and applying our new knowledge to address the problem for us. Because of this, we were inspired by the existing resources and findings from related research articles. Below are the primary resources that aided us during this project.

Monte Carlo simulation and rollout procedures are intentionally kept simple in this post. There is a very informative and well-written series of Medium posts detailing these techniques. We strongly suggest reading these posts for readers who are curious to learn more about these.

The SeqGAN and QE-CGAN studies were our main guidelines for this research as they approach a similar problem definition. For more detailed and formal information about the utilization of GAN models for query expansion problems, we encourage you to check these articles.

We hope you like this post about our recent research findings. We have many more exciting posts in Insider Engineering with many various subjects. Please check them too. Thank you for reading!

--

--