Unleashing the Power of Natural Language Processing: Restaurant Review Summarizer

Published in

İstanbul Data Science Academy

8 min readDec 5, 2023

Hello everybody.

For our fourth projects at Istanbul Data Science Academy’s Data Science Bootcamp, each member of Team-2 has chosen to work independently. In this article, I want to present my project where I went through all the steps of a topic modeling project.

In this project, which I simply call ‘Restaurant Review Summarizer’, a splendid restaurant aims to thoroughly examine customer reviews on its online platform, sorting them into distinct themes for a more detailed analysis. They have sought my assistance in this effort. I initiated the project by examining sample restaurant reviews on tripadvisor.com

Web Scraping

During the initial phase, I conducted web scraping on tripadvisor.com employing BeautifulSoup, a skill acquired in the second project. As the methods utilized were consistent with those detailed in my second project article, I won’t reiterate them here. In this phase, I acquired 210 links to restaurant pages from 7 search result pages. Subsequently, I scraped a cumulative total of 3424 customer reviews from the pages of these restaurants.

Data Storage

In recent projects, there has been a shift towards storing data in the NoSQL databases instead of local machines. To underscore the significance of NoSQL technology, we were tasked with storing our data on MongoDB for this project. I saved the documents acquired through web scraping as a JSON file and deposited them into MongoDB using the MongoDB Compass tool.

Data Preprocessing

After retrieving the documents stored in MongoDB to the Jupyter notebook environment using the pymongo library, I commenced the pivotal stage: pre-processing. Here are a few examples before the pre-processing phase. Numerous diverse documents like these required cleaning up.

Removal of HTML Tags, Alphanumeric, Punctuation, and Extra Spaces

Initially, I eliminated HTML tags, punctuation marks, and non-English alphabet characters, except for specific cases, which I’ll address shortly. Additionally, I converted all the documents to lowercase, resulting in somewhat cleaner documents.

df['review'] = df['review'].str.replace("<br />","")
df['review'] = df['review'].str.replace("\n\n","")
df['review'] = df['review'].str.replace("…"," ")
alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
single_spaces = lambda x: re.sub(' +', ' ', x)
df['review'] = df.review.map(alphanumeric).map(punc_lower).map(single_spaces)
df

Removal of Repeating Letters

Subsequently, considering these documents were restaurant reviews, they contained certain meaningless words, such as repeated letters. As these are not part of the stopwords, they wouldn’t be removed when eliminating stopwords. Therefore, I devised a function using regex to reduce them to a single letter. I’ll address them further in the upcoming steps, though there is a more accurate solution that I’ll demonstrate later.

def remove_repeating_letters(s):
    words = s.split()
    result = []
    for word in words:
        new_word = re.sub(r'(.)\1{2,}', r'\1', word)
        result.append(new_word)
    return ' '.join(result)
df['review'] = df['review'].map(remove_repeating_letters)

Removal of Emojis and Dropping Rows which Contain Non-English Text

Next, I employed the emoji library to eliminate emojis from the lines containing these symbols. Initially, when I removed punctuation, emojis were not included in the removal process. Therefore, I’m handling the removal of emojis in this step.

df['review'] = df['review'].apply(lambda x: emoji.replace_emoji(x, replace=''))
df['review'] = df.review.map(single_spaces)

I utilized the langid library to identify lines with non-English text and promptly excluded them from my dataset. There were only 7 such lines, resulting in minimal data loss.

df['language'] = df['review'].apply(lambda x: langid.classify(x)[0])
df = df[df['language'] == 'en']

Lemmatizing and Filtering Only English Words

In the subsequent step, I employed lemmatization to convert all plural words into their singular form. This is crucial, as plural words often transform into entirely different words, presenting a significant challenge.

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
def lemmatize_word(word):
    return lemmatizer.lemmatize(word, pos='n')
df['review'] = df['review'].apply(lambda x: ' '.join([lemmatize_word(word) for word in x.split()]))

Subsequently, I utilized the word set from the nltk library to eliminate all non-English words from each document.

def lemmatize_and_filter(word):
    lemma = lemmatizer.lemmatize(word, pos='n')
    return lemma if lemma in set(nltk.corpus.words.words()) else ''
df['lem_review'] = df['review'].apply(lambda x: ' '.join([lemmatize_and_filter(word) for word in x.split()]))

The goal here is not to exclude foreign languages; we’ve already addressed that. Instead, the objective is to filter out misspelled or nonsensical words. In fact, once this is accomplished, many of the preceding steps become unnecessary. However, I proceeded with those steps for experimental purposes.

Removal of Stopwords and Unimportant Words

Now it’s time to eliminate stopwords. In this step, I utilized nltk’s stopwords set and identified additional stopwords from the internet that were relevant to my project. Removing these stopwords proved highly beneficial during the modeling phase. For instance, words like “restaurant” appear in every document but don’t contribute meaningful information to the sentences. Since each sentence is about a restaurant, such words were removed in addition to the nltk stopwords set.

nltk.download('stopwords')
stop_words = stopwords.words('english')+["a", "lot", "of", "different", "stopwords"]
stop_words_set = set(stop_words)
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word.lower() not in stop_words_set])
df['sw_review'] = df['lem_review'].apply(remove_stopwords)

Removal of Binaries

In this step, I generated pairs of combinations for all 26 letters in the English alphabet using a for loop. Subsequently, I removed these combinations from my documents since they hold no significance for my analysis. However, special attention was required in this process because, apart from a few exceptions, the majority of these combinations were not part of the stopwords sets.

The question that arises here is, if I eliminate all non-English words, won’t they disappear naturally with each transformation step? Firstly, despite removing non-English words, the alterations I make in each step can introduce changes and new words that I may want to filter out. Secondly, considering this is an educational project rather than a real-world application, I aimed to explore various techniques that came to mind, leading me to implement this step.

Lemmatizing and Removal of Adverbs

In this stage, I employ lemmatization on all words and additionally eliminate adverbs. Instead of stemming, which applies a stringent transformation and may cause some words to lose their meaning, I opted for lemmatization. Lemmatization provides a more meaningful reduction of words to their roots, aligning better with the requirements of my project.

def lemmatize_and_remove_adverbs(sentence):
    doc = nlp(sentence)
    lemmatized_tokens = [token.lemma_ if token.pos_ != 'ADV' else '' for token in doc]
    lemmatized_tokens = [token for token in lemmatized_tokens if token]
    return ' '.join(lemmatized_tokens)
df['new_lem_review'] = df['new_review'].apply(lemmatize_and_remove_adverbs)

A question might arise as to why I am removing adverbs. Aren’t words like “incredibly”, “consistently” meaningful? Given that I’m not conducting sentiment analysis and instead focusing on concrete words related to topics such as price, flavor, service, etc., I find that removing adverbs aligns better with the specific goals of my project.

Modeling

Training the Model

After completing sufficient pre-processing steps, I proceed to the modeling phase. I experimented with three algorithms, LSA, LDA, and NMF, and found that the NMF algorithm provided the best results, so I continued with it. I trained the model with the objective of organizing my documents into 10 topics.

review = df.review.to_list()
vectorizer = CountVectorizer()
doc_word = vectorizer.fit_transform(review)
nmf_model = NMF(n_components=10)
doc_topic = nmf_model.fit_transform(doc_word)

Displaying Top 10 Keywords for Each Topic

Below, you can observe the 10 most significant words associated with each topic.

def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("Topic ", ix)
        else:
            print("Topic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

Assigning Topic Labels

Certainly, as I can’t simply tell the end user that their text falls under “Topic 1” or “Topic 2,” I assigned meaningful labels to each identified topic.

topic_labels = {
    0: 'Efficient Dining Service',
    1: 'Quality Food and Atmosphere',
    2: 'Wine and Culinary Delights',
    3: 'Ordering and Waiting',
    4: 'Attentive Dining Experience',
    5: 'Delicious Dinner Atmosphere',
    6: 'Timely Dining Experience',
    7: 'Flavorful Menu Options',
    8: 'Overall Dining Experience',
    9: 'Hotel Breakfast Delights'
}
df['topic'] = df['topic_num'].map(topic_labels)

Web Interface Development

Encapsulating all Preprocessing Steps in a Single Function

During the development of the web interface, I considered the following: I’ve crafted a model, but I’ve implemented various pre-processing steps for it. If I directly convert the trained model into a web interface, the user’s input text will be utilized by the model without undergoing the necessary pre-processing steps. Consequently, I concluded that I should incorporate these pre-processing steps into the web interface. To achieve this, I consolidated all the pre-processing steps I applied into a single function. In essence, I defined distinct functions within a unified function.

Prediction of New Document

Afterward, I established a prediction function and deployed my Streamlit application.

Streamlit Application

You can test my model by clicking on this link.

Conclusion

I am delighted to have delved into the extensive realms of Natural Language Processing through an engaging project like topic modeling. What thrilled me the most in this endeavor was the understanding that I can preprocess any text as needed and apply it for various purposes.

Thanks to Everybody

Thank you all for sparing your valuable time to read my article.

Please visit my GitHub repository for additional sources related to my project: github.com/salimkilinc/istdsa_project04