Efficiently flag toxic or harmful texts and prompts: create a simple text based classifier via the csfy tool

8 min readJun 9, 2024

Identifying harmful or toxic prompts is obviously important to avoid an AI service such as an LLM generating problematic content. Whilst this can be dealt with via a pre-trained LLM (Open AI tends to have strong ‘protective’ measures) or via prompt engineering, this can be either expensive to host or else problematic around the IP of user’s data.

A cheaper and performant option is to train an SLM (Small Language Model) such as DistillBERT, to classify texts as neutral or harmful or toxic.

[Image generated by an AI — Adobe Firefly]

Classifying texts can be used to either flag some texts as somehow negative and so to exclude them, or it can be used to help further process the text. A simple ‘SLM’ (Small Langage Model) such as BERT can be used to detect ‘toxic’ language or detect if a user is attempting to misuse a service, for example, asking an LLM (Large Language Model) to perform a harmful action such as recommending how to hack a server.

What is BERT?

BERT is a smaller language model (about 500MB) which can be hosted cheaply with high performance. Although it was trained to predict the next word or sentence, it can be fine-tuned to predict a label.

Efficiently flag toxic or harmful texts and prompts: create a simple text based classifier via the csfy tool

What is BERT?

Written by Sean Ryan