Vestiaire Collective

How to create your own AI Chat Moderation model

Lessons learned from building a chat message classifier internally

Aurélien Houdbert

Published in

Vestiaire Connected

12 min readJul 2, 2024

1. Introduction — Why monitoring and moderating your platform’s chat is crucial

On Vestiaire Collective, buyers and sellers can discuss through a chat interface to get more information about products, negotiate prices, clarify item conditions, and agree on sales. This direct communication enables more personal and efficient transactions.

However, this chat feature is also an open door for scammers or users trying to avoid platform fees. Despite our platform’s security and protection measures, some users still try to exchange private information to finalize deals in person or on less secure platforms.

Scammers negatively impact user experience, resulting in decreased trust in our platform.
Circumvention directly affects GMV (revenue) and Cost Per Order by shifting transactions outside the platform.
Toxic behavior significantly degrades user experience and engagement.

Challenges

Automatically blocking messages and banning users from the chat raises several challenges:

Our AI model must achieve a balance of recall and precision, capturing a sufficient proportion of unwanted messages while accurately identifying when a message should be flagged. Incorrectly banning users can lead to poor user experience and decreased engagement.
Scam messages and circumvention attempts require different banning processes. Scammers should be identified quickly and permanently banned, while legitimate users attempting to trade outside our platform should undergo an educational process with progressive banning measures.

2. Chat message classification — A short review of available solutions

There are various approaches you can use to classify text.

Regex

Regex, or pattern matching, is often the easiest way to classify text. It involves defining a set of prohibited words/patterns and creating regex rules around them. However, developing a comprehensive list of patterns can be time-consuming and is not robust enough to counteract sophisticated scammers. Regex also lacks semantic understanding, leading to misinterpretations.

For example, if your regex rules include the word “Instagram,” you will not be able to differentiate between:

Circumvention attempt: “Do you have Instagram ?”
Legit information: “I bought it 2 years ago from someone I met on Instagram.”

In our case, the first message intends to move the discussion to Instagram, while the second message only provides information on the item’s origin.

Classic Machine Learning

Machine Learning is a powerful tool for text classification. Most traditional ML techniques utilize word counts and co-occurrence methods, such as TF-IDF (Term Frequency-Inverse Document Frequency). Common algorithms include Naive Bayes, Support Vector Machines (SVM), and Logistic Regression. In many cases, traditional machine learning can achieve performance comparable to larger deep learning models, especially when datasets are not very large or complex.

However, they face three main challenges in the context of chat message moderation:

Scammers’ pattern adaptability: Scammers can quickly adapt their language and strategies, rendering your model’s vocabulary and features obsolete.
Multilingual setting: In a multilingual environment, the vocabulary can grow rapidly, resulting in very large embeddings. This can lead to increased computational resources and complexities in model management.
Broader context comprehension: The vocabulary size can significantly increase when n-grams are used to include more “word groups.”

Deep Learning LLMs

Large Language Models (LLMs) such as BERT excel in text classification tasks due to their transformer-based architecture, which captures text semantics and nuances.

These models are relatively easy to use and fine-tune using the transformers library from HuggingFace. HuggingFace provides pre-trained models and user-friendly tools to customize them for your application.

However, there are several considerations to keep in mind when using deep learning models:

Computational Resources: Training, fine-tuning, and deploying BERT or other deep learning models can be resource-intensive.
Data Requirements: Deep learning models often require large amounts of labeled training data to achieve optimal performance. Acquiring and labeling this data can be time-consuming and expensive.
Interpretability: Deep learning models, especially those based on transformers, do not easily provide insight into which features are used to make decisions, which can be an issue in applications requiring high levels of transparency.

Fine-tuning pre-trained models can help tackle the two first points. Indeed, you will need much less data and much less computing resources. Using the HuggingFace model repository, you can find open models pre-trained on general tasks or tasks similar to yours.

3. Data collection — How we built a dataset out of poorly labeled data

Vestiaire Collective historically used a regex moderation system to identify suspicious messages, which were then manually reviewed by human annotators. This process resulted in a dataset of manually verified messages, providing an excellent source of information for the model to understand semantic nuances in messages.

But there were still two main issues:

This data only represents messages flagged by regex, leaving gaps in unflagged patterns.
Human labelers are not 100% accurate.

Heuristics

To tackle the first issue, we came up with various heuristics to enhance our dataset with safe circumvention and scam messages from our dataset of chat messages sent on the app.

For instance, one heuristic we employ to identify scam messages involves analyzing the number of line breaks, messages in the channel, and account age in days.

We have implemented similar heuristics for phone numbers and safe message identification, among others.

Message relabeling

Because human labelers are not 100% accurate at classifying messages, they create inaccuracies that result in less stable training and lower performance.

To improve labeling accuracy, we tested self-training and clustering/majority voting techniques.

LLM relabeling

When exploring data relabeling solutions, we also tried to use the latest LLMs to label our dataset. Generative AI LLMs have strong semantic understanding capabilities. With a little prompt engineering, it is possible to describe our moderation rules and chat guidelines to the model.

In our experiments, we used private models such as ChatGPT (3.5 and above) from OpenAI and Claude (version 1 and above) from Anthropic, in addition to trying open-source models such as Mistral (Mixtral-8x7b, Mistral-7b) and Llama 2.

These models have strong semantic understanding capabilities but fail to understand the intent behind a single message. To reuse the same example — “Do you have Instagram?” — most of these models fail to understand the real underlying intention, which is to move the discussion outside the platform.

To improve performance, we tried various prompting strategies:

Few shot example prompting: providing a few examples of correct classification with expected output format.
Reasoning strategy: provide a reasoning framework to force the model to explain the message content and interpret the intent before providing a definitive label.

These methods improved the raw performance of the model but weren’t good enough to relabel our entire dataset.

We also fine-tuned small open source models (7B and 13B versions of Llama 2 and Mistral) on a small sample of curated messages and labels. With fine-tuning, we tried to teach the model our moderation guidelines and some reasoning strategies. It worked and the model learned our rules and reasoning strategy, but it still could not understand underlying intents.

Even though the results are not satisfying yet, this work is still ongoing.

Contextualization of messages

The first versions of our model were performing inference at the message level. This strategy works well but sometimes lacks the context of previous messages. For example, the message “33” could be an answer to a user inquiring about the size of the item, but it could also be the first part of a phone number sent over multiple messages (+33 is the French country code).

🙍‍♀️: “Hey, what is the size?”
🙎‍♂️: “33”
🙎‍♂️: “Here is my”
🙎‍♂️: “number”
🙎‍♂️: “33”
🙎‍♂️: “06~”
🙎‍♂️: “ 82”
…

Such a model can work on messages concatenated with a few past messages, but the performance is not great because messages with their context are a lot longer than single messages. The first version of the model had not been trained on such message lengths and often got lost with too much information in large context messages.

Re-building labeled messages within entire conversations

To achieve great performance for both individual messages and messages within their context (past messages), we needed to rethink our dataset. We redefined our heuristics to identify conversations where all messages are safe.

For conversations with unsafe messages, we ideally need perfect labels for all messages in the context. This allows us to create data batches with varying context lengths, helping the model understand what makes a conversation unsafe.

However, as the dataset size increases, this requirement becomes less critical since larger datasets naturally capture more nuances and variations.

Data processing

We mentioned that using LLM models and tokenizers from HuggingFace requires very few pre-processing steps.

However, some special characters might not be handled correctly by the tokenizer and be attributed an [UNK] unknown token (a default token for elements/words/subwords not available in the tokenizer’s vocabulary). This is typically the case for emojis 😀. If your pre-processing doesn’t handle emojis correctly, scammers might be able to communicate information through emojis.

“0️⃣6️⃣4️⃣2️⃣…” and “🔥🔥🔥🔥…” will both be tokenized as “[UNK] [UNK] [UNK] [UNK]…” making it extremely difficult to correctly predict the message label.

But if you convert emojis to text before tokenization, you will end up with a far better emoji representation in your model.

“0️⃣6️⃣4️⃣2️⃣…” is converted to “:zero: :six: :four: :two: …” which will give a precise tokenized sentence.

As part of the tokenizer choice, you can also experiment with the cased (sensitive to case) or uncased (all characters are lowered, accents are removed, etc.) versions. If your use case involves only English text classification you might want to head towards an uncased tokenizer, whereas in a multilingual setup, a cased tokenizer is better suited to keep all accents and specificities of languages.

4. BERT — Efficient text classification using Transformers Architecture

BERT

For text classification tasks, you can find many architectures and pre-training open source. We chose a pre-trained BERT mostly because it was trained on multilingual data (> 100 languages) and had different tokenizers available.

The BERT model we use is a relatively small model of 179 million parameters which requires only 700 MB to fit into memory. Although you need only one small GPU for efficient fine-tuning, this model can be deployed on a CPU and still guarantee a short response time. In our case, the 95th percentile response time is below 100ms. In comparison, a 70 billion parameter LLM (such as Llama 3 or Mistral 70b versions) would require 260 GB to fit into memory.

The very first version of our model was a binary classifier. The poor initial data quality led to merging circumvention and scam labels into one unique category.

Performance was great at this point, but we needed to differentiate between a scam message and a circumvention attempt.

BERT multi-class

Distinguishing between scam messages and circumvention attempts enables targeted blocking and banning procedures. Ideally, soft bans can be implemented for legitimate users unaware of chat guidelines, while scammers should be subject to a more stringent hard ban procedure.

Therefore, the next versions of our classifier were trained on multi-class data and refined using the methods detailed in previous sections.

We observed a slight performance decrease when distinguishing between circumvention and scam messages, likely due to the semantic similarity between both classes. The model struggles to differentiate between the two categories, leading to lower confidence in each class.

BERT + ML classifier

Through the two previous versions of our model, we noticed that binary classification yielded better results but failed to distinguish between circumvention attempts and scam messages. To address this limitation, we incorporated a CatBoost classifier to predict the likelihood of a message being a scam. This approach leverages categorical features such as account age, sender type, and purchase history to improve our model’s accuracy.

5. Model training — Technical details

Batch creation

Given a set of conversations, how can we create training samples? We want the model to train on single messages and messages within context. To illustrate our sampling process, let’s use this example conversation:

“Are you on Instagram?” is a circumvention attempt. The user is trying to move the conversation to Instagram to continue negotiating prices and avoid platform fees.

From this conversation, we can create various data samples:

Single message with positive label

Message with context with positive label

Message with context with negative label

From one single conversation, we can create multiple training data samples of various lengths, various numbers of messages included in context, etc. By doing so, your model will learn to deal with varying message lengths and understand what makes a set of messages or conversations unsafe.

This sampling process also highlights the need for clean labels at the message level. As previously mentioned, having labels for every message in a conversation is more beneficial than having a dataset with labels on individual messages sampled from different conversations.

If you have a large amount of data, it will naturally capture more nuances and variations without needing such a sampling strategy.

Data augmentation

Data augmentation in NLP tasks is less straightforward than for computer vision tasks. The sampling process we use can already be viewed as a dataset augmentation technique.

Chat messages are often misspelled, either by inattention or because scammers are deliberately misspelling words to try and bypass the model. Based on this observation, we came up with three augmentation techniques:

Random character deletion: randomly removing characters in messages. This will break words or tokens, forcing invariance on word misspelling.
Random character insertion: randomly inserting characters in messages. This will break words or tokens, forcing invariance on word misspelling.
Digit replacement: replacing all digits in a message will introduce invariance regarding phone number, prices, sizes, etc.

AWS training job and hardware choice

To train our BERT model, we use an AWS SageMaker training job with Hugging Face estimator on a g4dn GPU instance. The model is small enough to fit on the NVIDIA T4 GPU. The model itself requires as little as 700Mb of RAM to fit on the GPU. The limitation will come from the maximum length of inputs and batch size. In our case, training with a maximum length of 512 tokens (upper bound input length of BERT model), we are limited to a batch size of 12 to fit in GPU memory.

We trained the model for 3 epochs using the AdamW optimizer with linear learning rate decay starting from lambda=1e-5, batch size of 12 with 4 steps of gradient accumulation (larger batches don’t fit on a T4 GPU), weight decay of 3e-4 for regularization and a few warmup steps. We use default values for Beta1, Beta2, and epsilon for the AdamW optimizer.

6. Evaluation — How do we compare models?

Evaluation is a tricky process.

Training evaluation

We use a test set to measure classification metrics such as precision, recall, F1 score, AUC, etc. However, the ground truth labels in our test set are not 100% accurate, which introduces noise and makes it difficult to compare models. Due to these inaccuracies, the differences in metrics between models may be smaller than what is required to achieve statistical significance. This means that any observed differences could be attributed to the noise from mislabeled test data rather than true performance differences, making the comparison statistically meaningless.

Inference evaluation

To evaluate model performance in production, we follow metrics such as precision and recall but also customer contacts or the number of users targeted by bad messages.

We achieve this by using human labelers who review a sample of predictions of the model.

7. Serving — How to serve this model in production?

Even though the model needs to be trained on GPU, we can perform inference on CPU for real-time moderation (with a 95th percentile below 100ms).

Our API, written in Python FastAPI, is served on Kubernetes for optimal service scaling and cost optimization.

8. Conclusion

Chat moderation is crucial to prevent GMV loss due to platform circumvention and improve user experience.

Data is often the main driver of success. More data and granular labels can truly unlock great performances. In our case, data collection, processing, and relabeling were our biggest challenges.

Through continuous refinement and evaluation, our AI-powered moderation system can adapt to evolving threats, maintaining a secure and engaging environment for all users.

Even though BERT is not the latest LLM model out there, it is a better fit for our use case: smaller, faster, bidirectional encoder, etc. Large GenAI models can be a good solution to get started on a subject but keep in mind that there are many other great models out there designed specifically for your use case.