French reviews’ analysis at Hexamind — Part 2 : classify the reviews using a transformer model
Recall from our last tutorial how we scrapped cunstomers’ reviews from the company Carrefour. We saved the reviews into a csv, and labelled them into four classes (Buying Experience, Product, Delivery Mode and After Sale).
The whole code can be found in github.
In this tutorial we will play around our data, do some statistics to better undertsand it, and train a classifier to classify our reviews. So, let’s go :)
1- Do some statistics around our data
Recall that each of our reviews can belong to one or several of four classes. This is what we call a multi-label classification.
Let us first prepare our environment by importing the necessary librairies:
import pandas as pd
from google.colab import drive
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud,STOPWORDS
import numpy as np
And reading our csv annotated file. Note that we use google colaboratory in this work and put the csv inside a created folder in our drive. As of today, google colab can only access to the drive of the person that is running the code, hence by mounting google colab on the drive, only the current user’s drive is visible to colab. Therefore, for executing the code, the folder that containes both the code and data should be put directly in one’s drive. You should also authorize golab to access to your drive.
drive.mount('/content/drive')
# specifying the working directory
main_Folder="/content/drive/My Drive/notebooks/"
data_Folder = main_Folder+'data/'
# folder of training
csv_data=data_Folder+'20230220_selected_df.csv'
#csv_indices = data_Folder+ 'with_embed.csv'
df_data = pd.read_csv (csv_data)
Let’s get the number of reviews we have:
print("Number of labelled reviews is ",df_data.shape[0])
This results in labelled 276 reviews. When labelling our data, we named the classes as:
Hence, we will go back to the original naming for a better visualization. Note that during the labelling process, we labelled our reviews not only on the four classes, but also on sub-classes for each class. In these tutorials’ serie, we will be using only the four super classes.
print("The fourth column are the reviews text whilst the last four columns are the labels (categories) of the reviews. Note that one review can belong to several categories")
df_data= df_data.rename(columns={'clean_BE': 'Buying Experience',
'clean_PD': 'Product',
'clean_DM': 'Delivery Mode',
'clean_AS': 'Customer Service',
'Product': 'Product super_class',
'Buying experience': 'Buying experience super_class',
'Delivery Mode': 'Delivery Mode super_class','After Sales': 'After Sales super_class'})
Recall that our reviews are multi-label, so let us see some statisctics about how many categories the reviews are generally about:
rowSums = df_data.iloc[:,-4:].sum(axis=1)
multiLabel_counts = rowSums.value_counts()
multiLabel_counts = multiLabel_counts.iloc[1:]
sns.set(font_scale = 2)
plt.figure(figsize=(10,4))
ax = sns.barplot(multiLabel_counts.index, multiLabel_counts.values)
plt.title("Reviews having multiple labels per review \n")
plt.ylabel('Number of reviews', fontsize=14)
plt.xlabel('Number of labels per review', fontsize=14)
#adding the text labels
rects = ax.patches
labels = multiLabel_counts.values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom')
plt.show()
Note that we use the seaborn library, a statistical data visualization tool.
The previous histogram shows that most reviews belong to two categories. Let us see the number of reviews we have in each category (the last four columns in our data frame):
categories = list(df_data.columns.values[-4:])
sns.set(font_scale = 1)
plt.figure(figsize=(10,4))
ax= sns.barplot(categories, df_data.iloc[:,-4:].sum().values)
plt.title("Number of reviews in each category", fontsize=14)
plt.ylabel('Number of reviews', fontsize=14)
plt.xlabel('Review categories ', fontsize=14)
#adding the text labels
rects = ax.patches
labels = df_data.iloc[:,-4:].sum().values
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2, height + 5, label, ha='center', va='bottom', fontsize=18)
plt.show()
The result is as follows:
We note that most of Careffour’s clients are leaving reviews about the buying experience. Hence, we would like to check the most common words talked about for reviews of the category Buying Experience. The first step is to concatenate all of the reviews of the category Buying Experience and negative experience to have one single text that we will pass through the WordCloud method. We do this for positive and neutral experience, and repeat the same process for all the other categories. This might not be the best way to do it, but it is simple, intuitive and fast.
As for most sentiment analysis works, we will be grouping the reviews of ranting 1 and 2 to be negative, 4 and 5 to be positive and 3 to be neutral.
# First prepare the texts for each of the 12 analyses we want to make
Positve_BuyingExperience = ""
Positve_Product = ""
Positive_DeliveryMode = ""
Positive_CustomerService = ""
Negative_BuyingExperience = ""
Negative_Product = ""
Negative_DeliveryMode = ""
Negative_CustomerService = ""
Neutral_BuyingExperience = ""
Neutral_Product = ""
Neutral_DeliveryMode = ""
Neutral_CustomerService = ""
# Then, concatenante the texts of each of the categories and rankings (sentiments)
for i in range (len(df_data)):
if (pd.isnull(df_data["reviews"].iloc[i]))==False:
if (df_data["ratings"].iloc[i]<3):
if df_data["Buying Experience"].iloc[i]==1:
Negative_BuyingExperience += df_data["reviews"].iloc[i]
if df_data["Product"].iloc[i]==1:
Negative_Product += df_data["reviews"].iloc[i]
if df_data["Delivery Mode"].iloc[i]==1:
Negative_DeliveryMode += df_data["reviews"].iloc[i]
if df_data["Customer Service"].iloc[i]==1:
Negative_CustomerService += df_data["reviews"].iloc[i]
elif df_data["ratings"].iloc[i]>=4:
if df_data["Buying Experience"].iloc[i]==1:
Positve_BuyingExperience += df_data["reviews"].iloc[i]
if df_data["Product"].iloc[i]==1:
Positve_Product += df_data["reviews"].iloc[i]
if df_data["Delivery Mode"].iloc[i]==1:
Positive_DeliveryMode += df_data["reviews"].iloc[i]
if df_data["Customer Service"].iloc[i]==1:
Positive_CustomerService += df_data["reviews"].iloc[i]
else:
Neutral_BuyingExperience = ""
if df_data["Buying Experience"].iloc[i]==1:
Neutral_BuyingExperience += df_data["reviews"].iloc[i]
if df_data["Product"].iloc[i]==1:
Neutral_Product += df_data["reviews"].iloc[i]
if df_data["Delivery Mode"].iloc[i]==1:
Neutral_DeliveryMode += df_data["reviews"].iloc[i]
if df_data["Customer Service"].iloc[i]==1:
Neutral_CustomerService += df_data["reviews"].iloc[i]
Once the texte prepared for each sentiment and for each category, let us display the most common words. This gives us an idea about what the customers are happy/unhappy about their buying experience at Carrefour. Note however that some french stop words should be removed. Hence, we put all the french stop words in a text file that we will be using in the word crowd preparing.
excluded_words=[]
with open('stop_words.txt', 'r') as f:
txt_content = f.read()
excluded_words = txt_content.split("\n")
# can use https://www.ranks.nl/stopwords/french for french stop words
Pos_BE= WordCloud(background_color = 'white', stopwords = excluded_words, max_words = 200).generate(Negative_CustomerService)
plt.figure(figsize=(10,4))
plt.imshow(Pos_BE)
plt.axis("off")
plt.show();
The result is:
The same process can be applied to each an any sentiment/category. We notice that reviews are generaly about the time (jour), the requests (demande), the shopping (course), the delivry (livraison) and so on. Note that some words should be removed such as Carrefour. once we see the result of word croud, we can remove any words we think are not relevent by adding them to stopping words text file.
2- Train the head of an LLM to classify the reviews
Our task now is to classify automatically our reviews. This can be done using several Machine Learning techniques, however for NLP, the trend is the use of LLM’s or Large Language Models. This again can be following two strategies: Training the whole model (which we will be calling body training), or just extract the feature vector representing the text of the review (that we will call the embedding or tokenization) and use those features to train a classication ML model such as a logistic regressor, a K nearest neighbors, a bayes classifier or a neuran network.
In this section we will be using the head classification and check its results. We will also be using the Hugging Face library and the distilbert-base-uncased model. Our code is greatly inspired of the notebooks of Lewis Tunstall, Leandro von Werra, Thomas Wolf.
in google colab we need to install the hugging face libraries we need, namely transformers and datasets:
!pip install -q transformers datasets
and import the libraries we need:
import datasets
import huggingface_hub
import matplotlib.font_manager as font_manager
import torch
import transformers
from IPython.display import set_matplotlib_formats
from sklearn.model_selection import train_test_split
There are many stategies that allow us to do a multi-label classification. To dosimple in this tutorial and to get a basic understanding of the transformers, the hugging face library and the LLM training, we will us a simple one versus all classification strategy, i.e. we train a classifier for each of the four categories. The class will hence be 1 for the dataset fed to this particular classifier and 0 elsewhere in the set of reviews. We do it manually in this tutorial to better understand its functionning, but use libraries in the next tutorials that will help us generate the labels automatically. So, let us start with the Buying experience category. Hence, we will only take the column of Buying Experience in the dataframe:
l1=[]
l2=[]
l3=[]
for i in range (len(df_data)):
if (pd.isnull(df_data["reviews"].iloc[i]))==False:
if df_data["Buying Experience"].iloc[i]==1:
l1.append(df_data["reviews"].iloc[i])
l2.append(1)
l3.append("Buying Experience")
else:
l1.append(df_data["reviews"].iloc[i])
l2.append(0)
l3.append("Not Buying experience")
df_review=pd.DataFrame()
df_review["text"]=l1
df_review["label"]=l2
#df_review["label_name"]=l3
df_review.head()
This data frame looks like this:
The second step is to prepare a dataset that will be use for the transformers methods we will be using:
from datasets import Dataset, DatasetDict
X_train, X_test, y_train, y_test = train_test_split( df_review, y, test_size=0.33, random_state=42)
train = Dataset.from_pandas(X_train)
test = Dataset.from_pandas(X_test)
dataset = DatasetDict()
dataset['train'] = train
dataset['test'] = test
reviews = dataset.remove_columns(['__index_level_0__'])
The dataset looks like this:
Note how we divided our data frame into training and test, then converted it to a dataset.
The next step is to tokenize (as a batch) our reviews to feed their embedding (feature vector) to the classifier. We will be using the distilbert-base-uncased.
# hide_output
from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)
reviews_encoded = reviews.map(tokenize, batched=True, batch_size=None)
The result of the encoding will be the embedding added to the dataset:
One important step is to convert our embedding into a tensor. Here, we are using pytorch, hence we convert the encoded reviews to torch format:
reviews_encoded.set_format("torch",
columns=["input_ids", "attention_mask", "label"])
This tokenization does not give the final feature, actually, it will just map the text of the review to the dictionary of the model along with the hidden states. So the next step is to load the model but to extract the features, which will be a forward pass through the model and getting the last layer content (the hidden states) as the feature vector:
from transformers import AutoModel
import time
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)
def extract_hidden_states(batch):
# Place model inputs on the GPU
inputs = {k:v.to(device) for k,v in batch.items()
if k in tokenizer.model_input_names}
# Extract last hidden states
with torch.no_grad():
last_hidden_state = model(**inputs).last_hidden_state
# Return vector for [CLS] token
return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}
start_time = time.time()
reviews_hidden = reviews_encoded.map(extract_hidden_states, batched=True)
print('the embedding took :', round(time.time() - start_time), " seconds")
As we are in google colaboratory, there are chances we can use the GPU, but given the use of it, the access can terminate, hence we consider both cases. We then load the model, define a function that extracts the hidden states (the feature vectors or the embedding), and use the map function defined in the datasets library proposed by hugging face in order to speedup the embedding.
To summarize the processs we just did :
- The reviews are texts that we transform to a dataset
- Each review text is tokenized using the Bert tokenizer. The tokenization consists of transforming the text into numbers.
- Each tokenized review is embedded using the Bert model in order to have the hidden states, a set of float values that represent the feature vector.
Once we have the embeddings, we can do many tasks such as classification, summarization, masking and so one. In our case, we will be classifying our reviews. Let us first divide our data to a format accepted by the classifier we chose; a logistig regressor:
import numpy as np
from sklearn.linear_model import LogisticRegression
X_train = np.array(reviews_hidden["train"]["hidden_state"])
X_valid = np.array(reviews_hidden["test"]["hidden_state"])
y_train = np.array(reviews_hidden["train"]["label"])
y_valid = np.array(reviews_hidden["test"]["label"])
lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)
print(lr_clf.score(X_valid, y_valid))
The accuracy of this first classifier is 65% which is not high:
The result is not high and a deeper look shows that most of the reviews are considered of the class Buying Experience which shows that the model didn’t lean correctly. We can suspect this to be because as we have seen earlier, that the data is unbalanced and most of the reviews are of the Buying Experience class. One possible reason as well is the model used for both tokenization and embedding which is an english based model while our reviews are in French. We will see in the next tutorials how we can use French model and check the results.
3- Train the head of an LLM to classify the reviews
First we define the task we are doing, in our case a classificaiton:
from transformers import AutoModelForSequenceClassification
num_labels = 2
model = (AutoModelForSequenceClassification
.from_pretrained(model_ckpt, num_labels=num_labels)
.to(device))
As we will be training a model from hugging face and pushing it back as a checkpoint, we need to create and account at hugging face, get a token, and then use that token in our code to be able to push our model to the hugging face platform.
from huggingface_hub import notebook_login
notebook_login()
Once executed, we are expected to put the token (from parameters in the hugging face interface) into that filed in our code:
In that code we set the model name. This name should be the same as one model we create in our hugging face platfrom. This is because after training, our trained model will be pushed into that model host in the platform
from transformers import Trainer, TrainingArguments
batch_size = 32
logging_steps = len(reviews_encoded["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-reviews"
training_args = TrainingArguments(output_dir=model_name,
num_train_epochs=10,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps,
push_to_hub=True,
log_level="error")
Once the parameters set, we can train our model :
from transformers import Trainer
trainer = Trainer(model=model, args=training_args,
compute_metrics=compute_metrics,
train_dataset=reviews_encoded["train"],
eval_dataset=reviews_encoded["test"],
tokenizer=tokenizer)
trainer.train();
The result of training looks like this:
We can clearly see the lowring of the training loss, but the validation loss does not follow the same decrease. Let us now predict the labels for the test set:
preds_output = trainer.predict(reviews_encoded["test"])
Once we have the predictions, we can run get the classification report:
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
preds_output.metrics
The classification results still seem low, but recall that we trained for 10 epochs. Whether adding epochs, changing the model, or playing around the data which is not balanced. We will see all of these techniques in the next tutorials of this serie.
The last step is to push our model to the hugging face hub:
trainer.push_to_hub(commit_message="Training completed!")
Conclusion
This tutorial fucused on the second step of an NLP system for french review analysis. By the end of this totorial we have trained two classifiers both based on the LLM distilbert-base-uncased.
- The first approach was to train only the head but extracting the embedding that we get from the hidden states of the model with a forward pass, and training a logistic regressor.
- The second approach trains the same model wholly, which we call body classification, and pushes the model to the hugging face hub.
The whole code of this second part can be found in github.
Thanks for reading and see you in the next tutorial.