How ANWB classifies chat topics using TF-IDF and CatBoost in Python

ANWB data-driven
ANWB data-driven
Published in
10 min readJun 12, 2023

Recently the ANWB launched a new energy service which gives members access to the energy market at cost price. Being a new service, the ANWB receives a lot of questions from our members including on WhatsApp. In order to improve our services, we want to monitor any issues or questions that our members have so that we can prioritise future improvements.

One way to gather insights from these chats is by going through them manually and keeping a tally of the various topics our members have questions about. Obviously this a very time consuming activity and herin lies an opportunity for the ANWB to automate (part) of this.

As a Data Scientist at the ANWB, I was asked to come up with a simple-to-implement model to automate (part of) this categorisation process. Since at that point it was still done by hand, even a small increase in automatic categorisation would result in quite some time saved.

Luckily the business could provide a manually labelled dataset of roughly 3300 conversations. The data contained the combined text received from the customer (usually in Dutch) and the topic that it was labelled as. I could use this dataset to train a model to predict topics of chats.

The approach I have taken for this topic classification use-case is the following:

  1. Clean text, remove stopwords and any non-relevant data (emails / client numbers etc.)
  2. Lemmatize text using spacy language model for Dutch
  3. Create features using TfidfVectorizer
  4. Train a CatBoostClassifier model using these features
  5. Evaluate model

In the end I will also show how you can also use the TF-IDF vectorizer to extract the top 10 most important keywords per topic.

Below I will go through the steps that I took to clean the data and build a simple initial topic classification model in Python.

Data preparation

Before I begin I just want to note that the results of this project will purely be used to gain insight in the questions and wishes of our members and will not impact any one member negatively. Still due to privacy concerns, I will not be displaying any of the actual conversations. Instead I will use several “dummy” conversations for demonstration purposes. The dummy data has the same structure as the real data and looks as follows:

There are many different topics our members contact us for but to keep things simple we are only interested in the top 10 most frequently occurring topics for now. We will therefore keep these topics in our dataset and replace all other topics with “overige” (meaning “other”).

# Set any topics not in the top 10 to the value 'overige'
df = (
df.assign(
topic=lambda df_: df_.topic.where(
df_.topic.isin(df.query('(topic != "overige") & (topic != "onbekend")').topic.value_counts().nlargest(10).index), 'overige'))
)

The resulting topic count is as follows:

overige                      1719
startdatum 333
meternummers 288
slimme meter 162
opzeggen oude leverancier 157
190 euro 147
annuleren 144
kosten in de app 130
inloggen 119
automatische incasso 97
zonnepanelen 90

I have written a function to clean up the text by removing any non-ascii characters, converting the text to lowercase, removing stopwords and removing non-relevant data using regular expression. For the stopwords I used a predefined list of stopwords from the nltk library together with an additional user defined list specific for this use-case.

from nltk.corpus import stopwords
from unidecode import unidecode

stopwords_nl = stopwords.words('dutch')+manual_stopwords_nl


def clean_text(df, stopwords):

regex_rules = {
# remove linebreaks
r'\n': ' ',
# remove return characters
r'\r': ' ',
# remove postal code
r'(?<!\d)\d{4}\s?[a-zA-Z]{2}(?![a-zA-Z])': '',
# remove email
r'([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})': '',
# remove phone numbers
r'((\+|00(\s|\s?\-\s?)?)31(\s|\s?\-\s?)?(\(0\)[\-\s]?)?|0)[1-9]((\s|\s?\-\s?)?[0-9])((\s|\s?-\s?)?[0-9])((\s|\s?-\s?)?[0-9])\s?[0-9]\s?[0-9]\s?[0-9]\s?[0-9]\s?[0-9]': '',
# Remove bank account info
r'[a-zA-Z]{2}[\s.-]*[0-9]{2}[\s.-]*[a-zA-Z]{4}[\s.-]*[0-9]{2}[\s.-]*[0-9]{2}[\s.-]*[0-9]{2}[\s.-]*[0-9]{2}[\s.-]*[0-9]{2}[\s.-]*': ' ',
# remove URLs
r'http\S+': '',
# remove .,;:-/ if not between digits
r'(?<!\d)[.,;:\-/](?!\d)': ' ',
# remove remaining symbols
r'[!?#$%&\'()*\+<=>@\\^_`{|}~"\[\]]': ' ',
# remove dates (day-month-year)
r'[0-9]{1,2}[-/\s]([0-9]{1,2}|januari|februari|maart|april|juni|juli|augustus|september|oktober|november|december|jan|feb|mar|mrt|apr|mei|jun|jul|aug|sep|sept|okt|nov|dec)([-/\s][0-9]{2,4})?': ' ',
# remove lidmaatschapsnummer / klantnummer / contractnummer etc.
r'((lid|lidmaatschap(s)*|klant|contract|relatie)\s*(nummer|nr))*\s*[a-z:-]*\s*(anwb)*[\s-]?[0-9]{5,6}': ' ',
# remove 'ANWB' in text
r'\sanwb\s': ' ',
# remove any non-numerical characters
r'[^a-zA-Z0-9]': ' ',
# replace multiple spaces by one
r'\s+': ' ',
}

stopword_pattern = {'|'.join([r'\b{}\b'.format(w) for w in stopwords]): ''}

return (df
# convert to lowercase
.assign(text_cleaned=lambda df_: df_.text.str.lower())
# remove accents from letters and remove any non-ascii characters
.assign(text_cleaned=lambda df_:
df_.text_cleaned.apply(lambda x: unidecode(x)))
# remove stopwords
.assign(text_cleaned=lambda df_:
df_.text_cleaned.replace(stopword_pattern, regex=True))
# use regex rules to replace text that we are not interested in
.assign(text_cleaned=lambda df_:
df_.text_cleaned.replace(regex_rules, regex=True))
)

Applying the clean_text() function leaves us with an additional column with the cleaned up text called text_cleaned.

The next step is to lemmatize the cleaned text. By lemmatizing you are converting words to their base form. For example “best” becomes “good” and “walked” becomes “walk”. I have defined the following function to lemmatize the cleaned text:

nlp = spacy.load('nl_core_news_lg', exclude=["parser","ner","textcat","custom"])

def lemmatize_text(df):

return (df
# lemmatize text
.assign(text_lemmatized=lambda df_: df_.text_cleaned.apply(
lambda x: ' '.join([token.lemma_ for token in nlp(x)])))
# convert to lowercase
.assign(text_lemmatized=lambda df_:
df_.text_lemmatized.str.lower())
# fill any empty cells with the empty string
.assign(text_lemmatized=lambda df_:
df_.text_lemmatized.fillna(''))
)

Make sure that you have already downloaded a pre-trained language pipeline from spacy, in this case Dutch, using the following command:

python -m spacy download nl_core_news_lg

Running the lemmatize_text() function on our dataframe returns the following results where the lemmatized text is saved in the new column text_lemmatized.

The helper function below combines both steps above into one function to be used as a preprocessing step:

def process_data(df, stopwords):
return (df
.pipe(clean_text, stopwords=stopwords)
.pipe(lemmatize_text)
)

df = process_data(df, stopwords=stopwords_nl)

Building features

Now let’s build a simple classification model to predict the topics for our (actual) chat conversations. First we will define our text column and target and split the data into a train and test set.

from sklearn.model_selection import StratifiedShuffleSplit

# define text column to be used in training
X = df['text_lemmatized']
# define target
y = df['topic']

# split data in train and test set
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2)
train_index, test_index = next(sss.split(X,y))

X_train = X.iloc[train_index]
y_train = y.iloc[train_index]

X_test = X.iloc[test_index]
y_test = y.iloc[test_index]

I’m going to use a TF-IDF vectorizer to build my features. For this I have set a maximum document frequency (max_df) of 0.75, meaning that any words (or word combinations) that occur in more than 75% of all documents are excluded. Additionally, the minimum document frequency (min_df) I have set to be 0.001 meaning any words (or word combinations) occurring in less than 0.1% of all documents are excluded as well. The vectorizer includes the top 3.500 single words or up to the 4th n-gram.

from sklearn.feature_extraction.text import TfidfVectorizer

# defining the tf-idf vectorizer
tfidf_vectorizer = TfidfVectorizer(
strip_accents='unicode',
stop_words=stopwords_nl,
lowercase=True,
max_df=0.75,
min_df=0.001,
ngram_range=(1,4),
max_features=3500,
)

Model

As the model I choose to use the CatBoostClassifier. With a CatBoost model you can assign weights to each class (topic) which is useful in the case of an imbalanced dataset. Here I have weighed each class with the inverse of the occurrence of that class. I will use both the TF-IDF vectorizer and CatBoostClassifier as the steps within a scikit-learn pipeline.

from sklearn.utils.class_weight import compute_class_weight
from catboost import CatBoostClassifier
from sklearn.pipeline import Pipeline

# defining our classifier
classes = np.unique(y_train)
weights = compute_class_weight(class_weight='balanced',
classes=classes,
y=y_train)
class_weights = dict(zip(classes, weights))

clf = CatBoostClassifier(class_weights=class_weights, verbose=False)

# defining our pipeline with a preprocessing step and a classifier step
pipeline = Pipeline(steps=[
('vectorizer', tfidf_vectorizer),
('classifier', clf)
])

Now I can use this pipeline to train the model on actual data and use it to make predictions.

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Evaluation

To evaluate the performance of the model I used a test set to predict on. I have printed the classification report below. We see an average weighed F1-score of 0.70 which for a first model I think is not too bad. The model performs particularly well on 190 euro topic but not so good on the topic opzeggen oude leverancier.

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred)

We can also visualise these results by plotting a Confusion Plot. The confusion plot below is normalised across the Predict axis, meaning that vertically the cells add up to 1. This means that you can read the precision in each cell.

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay.from_predictions(y_test, y_pred,
normalize='pred',
values_format='.1f',
text_kw={'fontsize':10}
);
plt.title('Confusion plot\nNormalised vertically (precision)')
disp.ax_.get_images()[0].set_clim(0, 1);
plt.xticks(rotation=45, ha='right');

We can also plot the confusion plot normalised across the True axis (recall).

disp = ConfusionMatrixDisplay.from_predictions(y_test, y_pred, 
normalize='true',
values_format='.1f',
text_kw={'fontsize':10}
);
plt.title('Confusion plot\nNormalised horizontally (recall)')
disp.ax_.get_images()[0].set_clim(0, 1);
plt.xticks(rotation=45, ha='right');

From the confusion plots we can notice a couple of things. The model has a high precision and recall for 190 euro. There are several topics that the model cannot quite distinguish. startdatum, opzeggen oude leverancier and meternummers are often confused with each other. The same goes for slimme meter and meternummers. These confusions are understandable since the topics are quite close to each other and might occur tandem. One could dive into this more deeply and either merge some of these topics together, gather more data on these topics or use business rules and / or improve the cleaning step to improve the performance on these topics.

Another thing to note is that many topics are labelled as overige while in reality they had a different label. For this use-case we do not mind some topics being mislabelled as overige, as long as we keep the mislabelling of other topics to a minimal. Topics that are labelled as overige will be manually labelled anyway.

Predictions

Though we cannot show the real texts with predictions, running the model on our dummy dataset gives an indication of what the model predicted. As you can see the model correctly predicted 4 out of 5 texts, mislabelling one as overige which will mean that this will be manually labelled at a later moment.

Bonus: extracting top 10 keywords per topic

One interesting aspect of using a TF-IDF vectorizer is that one can extract the top-n most indicative keywords according to the term-frequency-inverse-document-frequency statistic. This can provide a useful insight for several applications, for example when designing chatbots or when you want to perform some kind of text classification using business rules or some sort of “dictionary” list.

In order to print the top-10 most indicative keywords per topic I first went ahead to fit another TF-IDF vectorizer, this time on the entire dataset since I’m not going to train a model for which I need to set aside a test set.

tfidf_vectorizer = TfidfVectorizer(
strip_accents='unicode',
stop_words=stopwords.words('dutch')+manual_stopwords_nl,
lowercase=True,
max_df=0.75,
min_df=0.001,
ngram_range=(1,4),
max_features=3500,
)

tfidf_vectorizer.fit(X, y)

I then used this vectorizer to transform the dataset into a matrix with the TF-IDF ratings. The column names will be the 3500 features (word combinations) and the index our target values (the topics). We will then group-by the index and sum along the rows and transpose the result. This will leave us with a dataframe with the topics as our columns, the keywords as our index and in each cell a numerical value that represents how indicative that keyword is for the corresponding topic.

# Get transformed dataframe with correct feature names as column names
tfidf_out = pd.DataFrame.sparse.from_spmatrix(tfidf_vectorizer.transform(X), columns=tfidf_vectorizer.get_feature_names_out())

# set targets to be the index
tfidf_out.index = y
# groupby topic and sum over the rating to get a rating per word per topic
tfidf_out = tfidf_out.groupby(tfidf_out.index).sum().T
tfidf_out

If we then take the top 10 highest rated keywords for each column we will get a list of the most indicative keywords per topic.

# choose number of keywords to show
top_n_keywords = 10
n_topics = tfidf_out.columns.shape[0]

# create dataframe with top n most indicative keywords per topic
pd.DataFrame(tfidf_out.index.values[np.argsort(-tfidf_out.values, axis=0)[:top_n_keywords,:n_topics]].copy(),
columns=tfidf_out.columns,
index=[f'keyword_{n}' for n in range(1,top_n_keywords+1)])

Conclusion

Using the method above we were able to predict the topic of chat conversations. After implementation, this model is already saving multiple hours of manual labelling per week.

This very simple model is a first step in making our services more automated. A possible next step, one that I’m particularly interested in, is the application of Large Language Models (LLM’s) to use-cases like this. A very popular example of this is ChatGPT, which is able to hold entire conversations and summarize large pieces of text. It would be very interesting to see how accurate it would able to predict the topic of conversations. That might be a fun topic for a future blog post!

References

If you want to learn more about method chaining using Pandas, here’s a blogpost that explains it very well.

Thanks for reading! From the Analytics Center of Excellence at ANWB
Thanks for reading! From the Analytics Center of Excellence at ANWB

Author: Yalda Mohammadian | Data Scientist

--

--