Combining predictive and generative AI for review moderation

Axel Tidemann
Takealot Engineering
4 min readJun 20, 2024

After purchasing on takealot.com, buyers are asked to write a review about the product. An accepted review is something that describes the product, for example:

This product stinks, the colour is ugly, and it easily breaks.

A rejected review does not describe the product. Often, reviewers describe the service instead. This would be a rejected review:

Takealot is the best company ever, and the delivery service rules!

This does not help the customers understand what the product is like.

Thousands of reviews are submitted each day. Earlier, when the volumes were much lower, human moderators would accept or reject each review, which would take hours or days. Even though this tedious process slowed down the process of getting customer reviews on site, it lead to Takealot having a wealth of labelled data.

The following sections describe the review process.

First point of call: check for foul language

This is the easiest test, which is simply a check for unacceptable words. In addition, the reviews are run through an off-the-shelf model that flags inappropriate language.

Accept or reject with predictive AI

A deep learning model is trained every month (to deal with data drift) to accept or reject reviews, using TensorFlow Extended. (To differentiate from generative AI, this approach is now often referred to as “predictive” or “traditional” AI.) The model is a binary classifier, using a frozen BERT layer to provide embeddings of the review text, followed by one-dimensional convolutions.

In addition, the review characters are tokenised and also convoluted. This is to cater for the use of emojis and other language peculiarities in a South African context. It is a requirement that reviews are in English, but there exists a lot of words that are used in an English context that are loan words from some of the many languages in South Africa (my favourite being “lekker”, which means “great” or “awesome”). This addition saw the classifier performance increase slightly.

The product rating (1–5, where 5 is best) is treated as a categorical variable, to help the classifier understand the differences between a low and a high rating.

Neural network architecture for classification of reviews.

This is a rather lightweight model that runs perfectly fine on a CPU, as a TensorFlow Serving deployment in a Kubernetes cluster. The machine learning model moderates about 90% of the submitted reviews, and a submitted review is normally moderated within a minute. In cases where it is not sure (i.e. the output is somewhere between 0 or 1), the review is sent for manual moderation.

Volume of reviews submitted in the past year. The coverage shows the percentage moderated by the machine learning model. Notice the peak in December and January, as that coincides with Black Friday and the holiday season.

Previously, when a review was rejected, customers were presented with a standard error message that urged them to read the review guidelines. This was a frustrating experience for many customers. However, building a classifier for the different rejection reasons is a hard task, due to the lack of labelled data and the severe imbalance between the classes. Enter generative AI.

Generative AI to provide rejection reasons

The introduction of large language models (LLMs) and, more importantly, their availability through APIs has become a real game changer for text-based AI applications. Whereas before you’d need hundreds of thousands of examples to train a model, now it is possible to do prompt engineering, zero- or few-shot learning, or fine-tuning with only a few hundred examples. The time to release a product has been greatly reduced.

The first attempt was to assign rejection categories through prompt engineering, using the PaLM 2 model (more specifically, text-bison 001) on Google Cloud Platform’s Vertex AI. This was very prone to hallucinations, as it was simply not possible to get the desired results.

After more trial and error, it was decided to try fine tune the PaLM 2 model. This is a simple approach, the only requirement is to specify input text and output text in a JSONL format, and submit the training job to Vertex AI.

Andrew Ng recommends to use a small dataset when fine tuning. There were around 200 samples in the training set. Since the pricing is opaque from a Vertex AI perspective, this turned out also be a wise (or rather, lucky) decision — it was quite the surprise to see the training job cost $200, around $1 per training sample.

The fine tuning job yielded good results, and is now currently live in production, providing detailed rejection reasons for customers, who are hopefully less frustrated, and can adjust their reviews accordingly.

Cost/benefit tradeoff analysis with regards to traditional methods and new, shiny toys

It is impossible to avoid the hype regarding LLMs, and their promise of speeding up application development that needs some form of natural language understanding and generation. However, there are still two very important considerations to keep in mind: cost per token and rate limiting. Even though token costs are dropping and rate limits are increasing, from a financial perspective (keeping in mind that Takealot operates in the South African economy, which is sensitive to changing exchange rates), what made the most sense in this case was to combine the traditional approach with the generative approach.

--

--