Improving Retail Media through Natural Language Processing at PromoFarma by DocMorris

Published in

Tech at DocMorris

14 min readJun 7, 2023

At PromoFarma by DocMorris, we believe in providing our customers with the most relevant products for their health. With our search engine and in-house recommender systems, we are able to serve our customers the best product suggestions in various areas of our webshop/app. Nevertheless, it doesn’t have to stop there. We believe that collaboration with external partners through Retail Media can help us introduce our visitors to top performing and new-to-the-market products, boosting particular items from those partners while maintaining a high level of relevance. In this article we will show how we are using Natural Language Processing to improve our Retail Media campaigns.

What is relevance?

The concept of relevance is key to e-commerce. Given a user u and a context c, how can we retrieve the set of items I that best suit the user’s intent, that is to say, the items that are most relevant to the user in the given context? Depending on the context, both variables u and c can take different forms, such as user preferences, session events or search queries.

In Search applications, the context is given by a query q, and the user is often represented by a set of factors Kᵤ extracted from items Iᵤ that the user has interacted with in the past. Unfortunately, these factors usually need to be computed beforehand, not in realtime. So, in an environment characterized by growth and a high ratio of new or anonymous users (for whom we do not have information about their past interactions), we cannot rely on having user factors. We are thus left with the context, in this case the query q, and the array of items that compose our catalog I. Search engines such as Elastic Search do a decent job of indexing the content of the items and evaluating based on the user’s query, solving the equation:

However, when we also want to boost items from external partners, we need to find a way to combine this relevance function with the potential benefit of prioritizing those items, keeping both users and partners happy.

Relevance in the context of Retail Media campaigns

Now that we understand the basics of relevance and why it is important, we can face the problem at hand: How to rank items that don’t necessarily maximize the relevance function but are interesting to have in the top positions? Many e-commerce implement a bidding system, allowing partners to bid for those keywords which are of most interest to them, using supply and demand to put a price on the top positions of search results for queries containing the aforementioned keywords. This is a valid solution, but relevance seems to get lost in the process. For example, a partner could naively bid for a keyword that is not related to their product, just because they know it is trendy. This is particularly troublesome if the catalog contains health and personal care products, where a user could be looking for “baby powder” and get presented with a “painkiller” ad. This is not a good experience for the user, and neither is it really for the partner, since they are paying for a position when their product is not relevant, and thus likely won’t generate many sales.

Enter Natural Language Processing

Natural Language Processing (NLP) is a field of computer science that deals with the interaction between computers and human language. It is a very broad field, here we will only focus on the part that is relevant to our problem: the ability to extract meaning from text. In this case, we will use NLP to extract the meaning from the queries that users have searched for in our webshop/app, as well as from the product descriptions that we have in our catalog. We will then use this information to find the most relevant keywords for each item that we might want to boost, so our partners can choose from these recommended keywords.

Data acquisition and preprocessing

The first step of any NLP project is to acquire the data. In our case, it might appear simple since we own both the queries and the catalog data. However, what happens if a user searches for a word that doesn’t yet exist in our catalog? For example, the COVID-19 outbreak caught everyone off guard, so it’s easy to imagine that people started looking for "covid"related products before that word was in use in product descriptions. To fill these gaps, we developed a pipeline which runs daily to detect the popular keywords in our users’ queries that do not yet exist in our catalog. It then downloads a corpus of text about each of those keywords, from sources such as Wikipedia or DuckDuckGo (using the libraries wikipedia and duckduckgo_search), so that our model can learn the meaning of those words.

One quick warning regarding this approach, it can be tricky when a word has multiple meanings. For example, when you search DuckDuckGo for the keyword “xls”, it returns a corpus of text about the Excel spreadsheet software, not about “xls medical” (a well-known brand of weight loss products) which is what our users are really looking for. We solved this issue by searching DuckDuckGo not only with a general query containing only the keyword, but also with some context specific queries including words such as “pharmacy” or “treatment”. In this way, we ensure that the DuckDuckGo results are related to the context of interest to our users.

# Some parts of the code presented from now on have been simplified for the sake of the article

import pandas as pd
import wikipedia
from duckduckgo_search import ddg


def __search_ddg(qry, lang="es", max_results=5):
    """Searches DuckDuckGo for a given query and returns the results

    :param qry: Query to search
    :param lang: Language of the query
    :param max_results: Maximum number of results to return
    :return: List of results
    """
    try:
        results = ddg(qry, region=f"{lang}-{lang}", safesearch="Moderate", time=None, max_results=max_results)
        if len(results) > 0:
            return pd.DataFrame(results)["body"].drop_duplicates().values
        else:
            return []
    # If there are no results just return an empty list
    except lxml.etree.ParserError:
        return []
    # If some other exception, write to log in addition to returning an empty list
    except Exception as e:
        logging.info(f"DuckDuckGo search for {qry} has failed: {e}")
        return []


def __search_wikipedia(qry, lang="es"):
    """Searches Wikipedia for a given query and returns the results

    :param qry: Query to search
    :param lang: Language of the query
    :return: List of results
    """
    wikipedia.set_lang(lang)
    try:
        doc = wikipedia.summary(f"{qry}")
        return [doc]
    # If there are no results just return an empty list
    except wikipedia.PageError:
        return []
    # If some other exception, write to log in addition to returning an empty list
    except Exception as e:
        logging.info(f"Wikipedia search for {qry} has failed: {e}")
        return []


def search_web(qry, lang="ES", max_length=200, max_results=5):
    """Searches DuckDuckGo and Wikipedia for a given query and returns the results

    :param qry: Query to search
    :param lang: Language of the query
    :param max_length: Maximum length of the results
    :param max_results: Maximum number of results to return
    :return: List of results
    """
    searches_lang_dict = {
        "ES": [
            f"productos tratamiento {qry} farmacia",
            f"marcas tratamiento {qry} farmacia",
            f"que es {qry}",
        ],
        "FR": [
            f"produits de soin {qry} pharmacie",
            f"marques de soin {qry} pharmacie",
            f"qu'est-ce que {qry}",
        ],
    }
    doc = ""
    for search in searches_lang_dict[lang]:
        if len(doc) < max_length:
            doc += " |||| ".join(__search_ddg(search, lang=lang.lower(), max_results=max_results)).lower()
    if len(doc) < max_length:
        doc += " |||| ".join(__search_wikipedia(f"{qry}", lang=lang.lower()))

    return doc.replace('"', "'")

The second step is to preprocess the acquired data. We will be using the nltk and bs4 Python libraries to clean and tokenize the text, to create an array of tokens for each product, query and web search result. Other approaches could be considered, state-of-the-art models such as Transformers use a subword tokenization approach, but for our case we will be keeping it simple, just splitting the text into words and removing stop words and punctuation.

More specifically, the nltk library provides us with the list of stop words for each language that our users speak, which we augment with a set of custom stop words that we have identified for our use case. Moreover, the bs4 library automates the removal of HTML tags, punctuation and capitalization (to avoid accounting for the same word multiple times due to inconsistent casing) from the text.

import bs4
import nltk
from nltk.tokenize import RegexpTokenizer
from unidecode import unidecode


def __get_dw_stopwords(language, data_warehouse):
    """Retrieves a list of custom stopwords from the data warehouse

    :param language: One of the languages accepted by the nltk.corpus.stopwords module
    :param data_warehouse: SnowflakeHook object
    :return: List of stopwords
    """
    query = """
    SELECT word
    FROM int.nlp_stopwords
    WHERE language = '{language}'
    """
    stopwords = data_warehouse.get_pandas_df(query, parameters={"language": language}).word.values
    return stopwords


def get_stopwords(language, get_dw_words=True):
    """Retrieves a list of stopwords for the provided language

    :param language: One of the languages accepted by the nltk.corpus.stopwords module
    :param get_dw_words: Boolean to select if custom stopwords from the data warehouse are to be retrieved
    :return: List of stopwords
    """
    nltk.download("stopwords")
    stopwords = nltk.corpus.stopwords.words(language)
    if get_dw_words:
        dw_stopwords = __get_dw_stopwords(language, data_warehouse)
        stopwords += dw_stopwords
    return stopwords


def clean_text(text, stopwords=[]):
    """Cleans the provided text (by applying transformations such as strip accents, lower, strip HTML tags, strip punctuation, avoid duplicated consecutive words, and remove stopwords)

    :param text: String to clean
    :param stopwords: List of words to exclude from the tokens
    :return: String preprocessed
    """
    text = unidecode(text)
    text = text.lower()
    text = bs4.BeautifulSoup(text, features="lxml").text
    text = re.sub(r"\b(\w+)( \1\b)+", r"\1", text)  # avoid consecutive duplicated words

    tokenizer = RegexpTokenizer(r"[\w]+")  # only alphanumeric characters, no punctuation
    tokens = tokenizer.tokenize(text)
    word_tokens = [w for w in tokens]
    final_text = " ".join([w for w in word_tokens if not w in stopwords])

    return final_text

Data Modeling

Now that we have created a curated corpus for each entry of I and each q from our users, how do we link them? With a NLP model.

NLP models have advanced greatly in recent years (yes, we have heard all the fuss around ChatGPT and GPT-4), but these powerful models are simply way too expensive to train from scratch in-house, and fine-tuning one instead would restrict us to only a couple of hundred new tokens, in the best scenario. Whereas, we need to be able to account for all the possible tokens that can be found in our corpus, which has specific terms in different languages, healthcare vocabulary and trademarked brand and product names (new ones every day). For this reason, we will be using a more flexible model, which is also cheaper to implement and simpler to debug.

This model is the gensim Doc2Vec skip-gram model, which is an implementation built on top of the Word2Vec model described in Efficient Estimation of Word Representations in Vector Space. It is a very simple neural network that learns to predict the window of N words around a particular word, fitting K factors that are used to embed implicit features of the words. So, what it basically does is for every word in the corpus, it learns a vector representation which is optimized to predict words in the same context. Since similar words are expected to be present in the same context, they end up having similar embeddings and are therefore closer in the vectorial space. Applying this same technique at a higher level, to documents instead of to words, allows Doc2Vec to return document vectors/embeddings, in addition to the Word2Vec word vectors/embeddings.

We train this model on the corpus that we prepared in the previous step (the corpus of tokens from all the products, queries and web search results), and store it so that at inference time we can use it to calculate the similarities between products and queries:

import gensim


def get_documents(corpus):
    """Preprocesses the corpus into a list of documents for training

    :param corpus: Pandas Series with the corpus
    :return: List of documents
    """
    documents = []
    for row in corpus:
        pre_processed_doc_text = clean_code(str(row))
        documents.append(pre_processed_doc_text)

    return documents


def get_corpus(documents, tags):
    """Generates TaggedDocument objects from a list of documents and tags

    :param documents: List of documents
    :param tags: List of tags
    :return: Generator of TaggedDocument objects
    """
    for document, tag in zip(documents, tags):
        yield gensim.models.doc2vec.TaggedDocument([x for x in document], [tag])


def train(
    corpus,
    vector_size,
    min_count,
    window,
    dbow_words,
    workers,
    num_epochs,
    data_warehouse,
    model_path,
):
    """Trains a Doc2Vec model on the provided corpus

    :param corpus: Pandas Series with the corpus
    :param vector_size: Dimensionality of the feature vectors
    :param min_count: Ignores all words with total frequency lower than this
    :param window: Maximum distance between the current and predicted word within a sentence
    :param dbow_words: If set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training
    :param workers: Use these many worker threads to train the model
    :param num_epochs: Number of epochs to train the model
    :param data_warehouse: SnowflakeHook object to connect to the data warehouse
    :param model_path: Path to store the trained model
    :return: None
    """
    logging.info("Generating train corpus...")
    documents = get_documents(corpus)
    tags = list(corpus.index.astype(str))
    train_corpus = list(get_corpus(documents, tags))

    model = gensim.models.doc2vec.Doc2Vec(
        vector_size=vector_size,
        min_count=min_count,
        sample=0,
        window=window,
        dm=0,
        dbow_words=dbow_words,
        workers=workers,
    )

    model.build_vocab(train_corpus)
    model.train(
        train_corpus,
        total_examples=len(train_corpus),
        epochs=num_epochs,
        compute_loss=True,
    )

    logging.info(f"Saving model into {model_path}...")
    save_model_to_s3(s3, model, model_path)

We will be using cosine similarity to calculate the similarities between products and queries, which is defined as:

where the vectors a and b are the vectors of the products and queries respectively. This measure of similarity has a value between -1 and 1, where 0 means that the vectors are orthogonal, 1 means that they are the same vector and -1 means that they are opposite vectors. This similarity is used to rank the queries for each product, the top N queries are selected as the most relevant ones.

This measure of similarity is already implemented in gensim under the model-docvecs.most.similar method, which returns the most similar vectors to a given one, along with their cosine similarity. These similarities are bounded to the range 0 to 1 as per its implementation. We can use this method to calculate the most similar queries for each product:

def __infer_vector_for_product_text(
    product_text,
    model,
    num_epochs=None,
):
    """Infer vector for a product text

    :param product_text: Text of the product
    :param model: Doc2Vec model
    :param num_epochs: Number of epochs to infer the vector
    :return: Vector representation of the product
    """
    pre_processed_doc_text = clean_text(product_text)
    return model.infer_vector(pre_processed_doc_text, epochs=num_epochs)


def get_most_similar_queries_for_product(
    queries,
    product_text,
    model,
    num_similar=1000,
    threshold=0,
    num_epochs=None,
    max_results=100,
):
    """Retrieve most similar queries for a product

    :param queries: Pandas dataframe with the queries
    :param product_text: Text of the product
    :param model: Doc2Vec model
    :param num_similar: Number of similar queries to evaluate
    :param threshold: Threshold to filter the results
    :param num_epochs: Number of epochs to infer the vector
    :param max_results: Maximum number of results to return
    :return: Pandas dataframe with the most similar queries for a product
    """
    returned_queries = queries.copy()

    positive_vector = __infer_vector_for_product_text(product_text, model, num_epochs)
    most_similar_results = model.docvecs.most_similar(
        positive=[positive_vector],
        topn=num_similar,
    )
    results = pd.DataFrame(
        {str(x[0]): x[1] for x in most_similar_results}.items(), columns=["search_query", "gensim_scoring"]
    )
    returned_queries = returned_queries.merge(results, on="search_query", how="inner")

    # Filter for matches above the threshold indicated which have characters, not numbers (so they are search queries, not product_ids)
    queries_filter = (returned_queries["gensim_scoring"] >= threshold) & (
        returned_queries["search_query"].str.contains(r"[a-zA-Z]+")
    )
    returned_queries = returned_queries[queries_filter].sort_values("gensim_scoring", ascending=False).head(max_results)
    return returned_queries

Results

So, how do we actually use the result of all these formulas, data acquisition and calculations? What does the model output look like? For us, the output is a table in our data warehouse with the gensim scoring for each of the most similar queries for each language for each product. This table is refreshed each time our batch inference process runs, but we can also use the model to do inference on demand, in realtime. For our in-house business team, the output is a dashboard in our Business Intelligence tool with an ordered list of the most similar queries for each language for each product. For our partners, the output is a list of suggested search terms for each of their products in our Retail Media Platform.

Here is an example, looking at that dashboard in our Business Intelligence tool, for a specific product and language, in this case, for the product “Heliocare 360° Color Gel Oil Free SPF 50+ Beige 50ml” for the Spanish language:

As you can see, the model ends up recommending search queries that are relevant to the product. They seem to be related to the brand of the product (“heliocare …”), to the functionality of the product (“fotoprotector”, “oil free”, …) and to another main player within that category of products in our webshop/app (“isdin …”).

Changing perspectives, here is another example, this time looking in our Retail Media platform, as if we were a brand partner with the product “Kneipp Valeriana Classic 60 Grageas”:

As you can see, the model ends up recommending that the partner advertises for search terms which are relevant to the product. Again, they seem to be related to the brand (“kneipp …”), functionality (“valeriana”, “sueño”, “gominolas”, …) and main players within that category (“Zzzquil”, “Aquilea”, …). In this way, our partners can choose search terms with confidence, knowing that these options are both popular in our search engine and relevant to users potentially interested in buying their product.

Once the partner begins advertising for this search term, you can see how their product then appears in our search results, boosted, but not out of place, clearly related to the other products found for that search:

Conclusion

As you have seen, by taking a simpler NLP approach, we are able to create a model which accommodates our custom vocabulary and can be re-trained frequently in order to avoid drift and react faster to market and user behavior changes. Additionally, by using third-party libraries and encapsulating most of our implementation details within idempotent functions, we are able to reduce the engineering effort required to maintain this model and it is more robust to future changes. All to satisfy our use case of continuing to provide our customers with the most relevant products for their health while at the same time collaborating with external partners through Retail Media.

Our real, in production pipeline inherits from the one demonstrated here and has some additional intelligence, such as weighting popular keywords more than niche ones in the model. If you want to see it and help us develop for tomorrow’s healthcare, join us at PromoFarma by DocMorris!

If by chance you are a potential partner in the Health and Personal Care sector interested in seeing how we can help you boost your products, don’t hesitate to contact us at advertising@promofarma.com.

Thank you all for your interest and happy coding!

By Josu Alonso, Data Scientist

References
- Gensim: Topic Modelling for Humans
- Natural Language Toolkit
- The Illustrated Word2Vec
- Efficient Estimation of Word Representations in Vector Space
- Airflow in PromoFarma