Text Summarization and Keyword Extraction from customer reviews in French (Part 2/ 3)

Kanawut Kaewnoparat
7 min readFeb 24, 2023

After we have seen how to use pre-trained model to tackle the summarization task on French reviews discussed on the first part of the series, we are going to dive deeper into extract only key information from each review.

In this task of key word extraction, we are to perform:

  1. Top 3 Key Nouns: to have only the most relevant nouns that represent the reviews
  2. Proper Nouns: in case reviews talk specifically about city name, branches, or product brand to gain more granular information for the analysis
Top 3 and proper nouns extracted from the reviews (highlighted in light green) by author

Part of our extraction pipeline is inspired from this great article of Keyword Extraction with BERT. Our whole pipeline is composed of main 5 steps as shown below in the graphic with the method on the right side, and the result output on the left side.

Key word pipeline by author

STEP 0: Prepare one review for the pipeline showcase

We are going to stick with the same review that we used from the previous summarization task and that also shows in the graphic at the beginning of the blog.

To recap for non-french speaker like me, the following review is talking about a lady who intended to buy an iPhone14, but given 1) it was out of stock, the store made her pick it up later but only to find out later 2) the package was empty and no matter how much 3) she could not contact the customer service

Review 
"J'ai commandé un Iphone 14 en magasin chez Carrefour Bonneveine à Marseille. L'appareil n'était pas en stock et le magasin m'a proposé de le faire livrer en relais colis. Lorsque je suis allée le chercher, quelle ne fut pas ma surprise à la maison de constater que celui ci était vide. Je l'ai immédiatement signalé à Carrefour, fourni tous les justificatifs demandés, porté plainte et là, depuis presque un mois rien ne se passe. C'est en cours, c'est en cours.... Mais nous avons plus de 1000 EUR de sorti et pas de téléphone. Le SAV est injoignable, il n'y a aucune réponse à nos multiples sollicitations par mail c'est au mieux de l'incompétence au pire de l'escroquerie. Pour des achats autres qu'alimentaires ce magasin est à fuir"

STEP 1: Word Tokenizer

One of the simplest ways to tokenize phrases or sentences is to use the CountVectorizer from scikit-learn. We fit this vectorizer on the input_text and then we can print to see the number of output tokens and token candidates themselves.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer().fit(input_text)
candidates = vectorizer.get_feature_names_out()

STEP 2: French Stop Words

The candidate tokens we have so far are full of stop words, or words that are very common in each language. Here, we see multiple French stop words such as “au”, “de”, “en”, “ce”, “un”, “la”.

It is recommended to remove these stop words before going further. You can find out more about why and how we should remove stop words in this article. Our main basic intuition is we want to filter these tokens so as to have only the most meaningful ones for keyword extraction.

We do not want to define all stop words by ourselves. So, I searched and found this wonderful Github repo that contains French stop words. We will be using Request to read and load to our local directory.

#We request this file content from Github repo

import requests
from pathlib import Path

if Path('french_stopword.txt').is_file():
print('already exists')
else:
print('not existed yet')
request = requests.get("https://raw.githubusercontent.com/stopwords-iso/stopwords-fr/master/stopwords-fr.txt")
with open('french_stopword.txt', "wb") as f:
f.write(request.content)


# opening the file in read mode
my_file = open("french_stopword.txt", "r")

# reading the file
data = my_file.read()

# replacing end of line('/n') with ' ' and
# splitting the text it further when '.' is seen.
french_stopwords_list = data.replace('\n', ' ').split(" ")
Example of French stop words from Github

Now that we have read and saved these French stop words int a list variable `french_stopwords_list`, we could paste this list into the CountVectorizer hyperparamter to skip these stop word tokens.

vectorizer = CountVectorizer(stop_words=french_stopwords_list).fit(input_text)
candidates = vectorizer.get_feature_names_out()

Great! Now we could see that the number of candidate tokens has shrunk from 89 to only 46 tokens. But we could make this candidate list better by only filtering our other linguistic part of speech and keep onlyNOUN and PROP_N tokens. Let’s move on to the next step introducing `spaCy`.

STEP 3: spaCy for Name Entity Recognition

To achieve the goal of extracting only noun and proper noun tokens, we nee d to install spaCy, one of the very useful NLP libraries for tasks such as Name Entity Recognition.

On google collab, we can easily install by running pip install. I selected fr_core_news_md which is the medium size version for French language.

!python -m spacy download fr_core_news_md
nlp = spacy.load("fr_core_news_md")

Now we want to declare a function that receive a input text and the return two lists: list of NOUN tokens and another list of PROPN tokens.

def slice_only_noun_token(ner, token_list):

noun_slice_list = []
proper_noun_slice_list = []
for word_idx in range(len(token_list)):
doc = ner(token_list[word_idx]) # we loop through each token in the candidate list

for token in doc: #then we check if each token falls into NOUN / PROPN parts of speech
if token.pos_ == 'NOUN':
noun_slice_list.append(token.text)
elif token.pos_ == 'PROPN':
proper_noun_slice_list.append(token.text)

return noun_slice_list, proper_noun_slice_list



noun_list, proper_list = slice_only_noun_token(ner = nlp,
token_list = candidates)
print(len(noun_list), len(proper_list))

With this part of speech filtering, we are one step closer to our final goal of keyword extraction as we have managed to filter out other numerical and non-noun tokens, with the remaining candidate list number at 24, almost 3 times smaller than our original token at 89!!

As for the proper noun, we could extract one token, “Marseille”, a city name. However, I find it interesting that the result during the time of writing the blog using Colab, and the result shown when hosted on HuggingFace do not yield the same result. With the hosted demo web interface, the proper noun list also includes “iphone”, “bonnveine”, and “stock”.

extracted Noun and Proper Noun from spaCy NER by author

STEP 4: Word Embedding for similarity computation

Even after step 3, we still have lots of candidates noun tokens left, 24 to be exact from our example. But we want to have only top 3 keywords from this remaining list. To do that, we can compute the distance of each token from the original text. Then we select 3 ones with the least distance.

But first we need to transform these to tokens into numerical features. We will utilize pre-trained BERT model to embed these tokens. We perform these steps 4 and the last one only on the noun list because the proper noun list contains few words as shown from previous step. But if you would like to do for the proper noun list, just follow the same coding.

!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("dangvantuan/sentence-camembert-large")

#then we use the BERT model to encode both the original text and remaning candidate noun list

text_encoded = model.encode([input_text])
noun_candidate_list_encoded = model.encode(noun_list)

STEP 5: Distance computation with cosine similarity

Lastly, once we have the numerical embedding between the original text and the remaining noun candidate tokens, we compute the pairwise distance between each token against the original text. The smaller distance signifies that particular tokens are more relevant to the text.

from sklearn.metrics.pairwise import cosine_similarity

distances = cosine_similarity(text_encoded, noun_candidate_list_encoded)

We can combine the steps 4 and 5 into one function having transfomer model, input_text, noun_list and numbe of top_n as hyperparameters:

def top_n_extractor(model, input_text,  noun_list, top_n = 3):
text_encoded = model.encode(input_text)
noun_candidate_list_encoded = model.encode(noun_list)
distances = cosine_similarity(text_encoded, noun_candidate_list_encoded)
keywords = [noun_list[index] for index in distances.argsort()[0][-top_n:]]

return keywords

And here is our final top 3 keywords extracted from the original review input…

Analysis of the result

From the initial review, we eventually reached to have 3 most relavant keyword: “téléphone” > telephone, magasin” > shop/ stor, and “sav” > Service Après-Vente (After Sales Service) with another proper noun of “Marseille”.

Although not all semantics element is stored, we can still infer from these keywords that this review has something to do with “after-sales service with the shop in Marseille”, which is a not too bad key word extractions.

Coming up next on our last blog on this series, we will build the ML application using gradio libary to showcase our demo to perform these two tasks of 1) Summarization and 2) Key word extraction given any customer review text in French.

This blog is created as a part of the author’s contribution during his internship at Hexamind.

Hexamind is a startup located in Paris with the aim to optimize organization potential in the domain of customer relationship using the NLP technologies.

--

--

Kanawut Kaewnoparat

Blogging to store and reflect on learning journey in data science.