Magic Mirror on the wall, where is the best restaurant of all?

Restaurant reviews analysis for B2B and B2C

Selecting a project

Hi! We are Tereza and Marketa. We both attended Czechitas course Digital Academy and together we created this data analysis project . We have met before and had similar idea what our project could look like so it felt like a good match.

In the beginning we tried to think of our own project that would ideally help some small company or non-profit organization. We got in touch with my friends from the Church of Brethren and tried to think of a data analysis project that could help them. (Un)Fortunatelly, we soon found out that they did not have enough data for a project that would let us apply the skills we were learning.

After this experience we decided to choose a project the opposite way — based on what we would like to learn. Since the first day of Digital Academy we had been interested in text analysis. Before DA we both worked with language (Tereza was a copywriter, I studied linguistics and worked in language education). Therefore our first choice for a mentor was Geneea. We prepared some ideas of our own, however, after a discussion with Geneea we decided to choose the topic they suggested — text analysis of restaurant reviews.

Our client

We decided that not having a specific client would be our advantage. We could make them up! The first client was obvious. We go to restaurants and want to eat well. What would be interesting for us to know? To analyse the other point of view we imagined a restaurant owner/manager and thought about questions they could ask.

👩👩 Two girls who like going out

  • We want to find a nice café where we could meet and work on the project. Where should we go?
  • What restaurant should we avoid?

👨‍⚖️ Restaurant owner/manager

  • ️What do people say about us?
  • I changed a manager six months ago. How is he doing?

Plan

We tried to plan the stages we needed to go through. We needed to get data, clean them and transform them, prepare texts for text analysis, prepare tables for visualizations and then come up with some cool conclusions.

The journey was obviously not direct. Many times we had to take a step back when we found out we needed more data or needed to organize the data in a different way.

Getting data

To be able to start, we first needed to get data. Geneea got us in touch with Apify, that enabled us to use their platform to scrape data. There are several websites that collect restaurant reviews. We decided to scrape two platforms, TripAdvisor and Restu, as working with two sources brings more accurate results and more possibilities in data analysis. Why these two? Apify already had a crawler for scraping TripAdvisor. However, in Tripadvisor there are reviews mostly in English and we wanted to analyse mainly Czech reviews. There are more free tools for analysing English than Czech, so focusing on Czech as well seemed to be an attractive challenge. Therefore we also chose Restu with mainly Czech reviews.

To get the best results we needed individual restaurants to have as many reviews as possible and for this reason we decided to scrape only restaurants in Prague. (In smaller cities the restaurants do not have enough reviews to get valid results.) However, the scripts can be used for analysing restaurants in any other city.

Scraping

This part we need to start by giving big thanks to people in Apify. They helped us get oriented in how the platform works and set us an account with increased limit of webpages we were allowed to scrape for free.

For scraping TripAdvisor we used a crawler that is already available on Apify. As mentioned above, Apify already has a crawler that scrapes reviews from TripAdvisor. However, the crawler still needed some adjusting. Code.

scraping TripAdvisor

At the end we had five tables. For each review we got these columns:

TripAdvisor data from web scraping

For scraping Restu we first needed to learn a bit of JavaScript. Here we would like to thank a lot to Ondřej Šlejtr who helped us write the basic structure of the crawler and then answered thousands of questions when we were embedding all the features we needed in the crawler. You can find the final script here. The data in the resulting table were structued in a different way than in TripAdvisor — We got a row of data for each page (not a review) the crawler visited. The table looked this way:

Restu data from web scraping

Preparing data for text analysis and visualizations

After getting data by scraping, the next steps were cleaning the data and processing them in a way that would result in two tables — one with restaurants and another one with reviews.

At this point we had a chance to work in Keboola Connection, which was a great advantage. Especially when cleaning data and joining datasets, we often came to a situation when we needed to take a step back and embed some changes before moving on.

TripAdvisor dataset

First we reduced five tables to two. We joined TripAdvisor Dessert, Coffee & Tea, Bakeries and Bars & Pubs into TA_MIX_PRAHA, the other table was TripAdvisor restaurants as TA_REST. (For technical reasons — time limit for a job in Keboola, we had to work with them separately.) Then we used Geneea app in Keboola to determine language distribution and joined the results with our data. Code.

After this we were finally able to join “mix” and “restaurants” together and dig into more cleaning. Here we removed duplicities and prepared the data so we could later join them with Restu data. Code.

The result was two tables: one with reviews and one with restaurants. The tables were not normalised (we left some info about restaurants in reviews table) as we did not know exactly what type of information we would need for text analysis, where the reviews table would be sent.

TripAdvisor tables

Restu dataset

Restu table needed reorganization so that each review would be on a separate row. Code.

Then we went through a similar process of cleaning and adapting data as with TripAdvisor dataset (code SQL, code Python) and ended up with two tables:

After processing both data sources and analysing language distribution, we decided to work with reviews in two languages — English and Czech. Although in the beginning we thought we would analyze only Czech reviews, adding English largely extended the dataset and allowed more exact analysis. In both sets we had alltogether 244,000 reviews, 161,000 in English and 79,000 in Czech.

Joining TripAdvisor and Restu

The next step was joining data from the two platforms. It included for example merging and adding columns. Code (restaurants). Code (reviews).

A lot of restaurants are in both platforms and need to be mapped. This appeared to be quite a challenge. First we needed to get geo coordinates for Restu. In Keboola there is an extractor where you upload addresses and it returns coordinates. As TripAdvisor coordinates are from OpenStreetMap (which is also a free provider in Keboola), an easy solution seemed to be to send Restu addresses there and then join the data. Due to some technical issues on OpenStreetMap side, this provider does not work at this time. Another free provider was Yandex. However, after successful processing we found out that Yandex and OpenStreetMap coordinates for the same address are different. So we send there distinct addresses from both Restu and Tripadvisor. Unvortunatelly due to specific spelling in TripAdvisor addresses Yandex could not find a large amount of them. Third time lucky, we sent the addresses to Google maps and were successful. Code.

Having coordinates from the same source is the first step to successfully map the restaurants. However, we cannot say that restaurants with the same coordinates are the same. There are a lot of cases when restaurants share the same address (e.g. in shopping centres) or when a restaurant was closed and a new one was opened on the same spot. Therefore we needed to write a script (in Python) that would match the same coordinates from different platforms and based on Levenshtein distance or fuzzy matching decide which restaurants are the same.

Unfortunatelly, this is not finished yet. If you are interested in this part of our project, please, come back later.

Text analysis

With reviews table being ready we could proceed to text analysis. For this part of project we used Geneea’s NLP platform The Interpretor. How does it work? The software processes a given text and based on embedded algoritms it is table to identify elements such as keywords, phrases, locations or persons, detect language and sentiment and incorporate metadata (id, date).

In the picture above you do not see the first version. Every set of text is specific and therefore a software, now matter how well it is set up, cannot distinguish everything correctly. However, you can tell the Interpretor how to read your data through giving it a “training file”. For example we knew that in our data Einstein and Anděla are not persons but a restaurant (Pizzeria Einstein) and a location (Anděl). Still we did not correct all entities (which would be a neverending job), just the ones we planned to use in our analysis. As we wanted to work with English and Czech reviews together, we had to map all Czech entities we wanted to the English ones. We also added some new entities and categorized them.

Entity categorization example. Favourite brand missing? Let us know!

Translating, categorizing and mapping might not seem to be the most exciting job, however, it allowed us to enjoy the suprising diversity of Czech and English language.

Visualizations

At this point we could start visualizing data to get to some conclusions. We worked in Tableau with our tables “restaurants” and “reviews” plus, thanks to text analysis, we had tables with frequent phrases, sentiment, subjects and their attributes and other.

Development of average score and number of reviews in time. Each dot represents a restaurant. Colour tells a score, size number of records. Time frame 5 years (10/2013–9/2018) by quarters.

Use case — Reviews analysis for a restaurant chain

Let’s go back to our imaginary restaurant owner/manager. We decided we would try to analyse data for restaurant Chain Potrefená husa. We created an interactive dashboard with review-related information about their restaurants. They can find there:

  • how people rate individual restaurants
  • what they criticise most
  • development in time
Interactive dashboard for Potrefená husa chain
We can select a specific restaurant and highlight all related data

When we study the data more closely we can see that restaurants Potrefená husa Vinohrady and Potrefená husa Dejvice have score related to waiters highly below the average. From business point of view it might make sense to investigate this issue as solving it could lead to higher customer satisfaction and therefore higher profit.

How do people describe Potrefená husa chain? We made a list of important subjects and words that attribute them:

Potrefená husa restaurants: subject — attribute relations

The distribution of subjects and attributes here is not surprising and general impression is positive. However, when we look at a restaurant with extremely low score, we can see that especially attribute distribution is rather different.

Švejk restaurant: subject — attribute relations

What do people think about us?

What does it say when people mention Potrefená husa in a different restaurant’s review? Why do they do it? Is it good or bad? We looked into individual reviews and tried to find out what brand Potrefená husa means to customers. Apart from individual recommendations we noticed these characteristics:

  • Potrefená husa is a benchmark

… hektičtější atmosféru připomínající spíše Potrefené husy … chutí někde mezi Potrefenou husu a Goddie burger … klasická restaurace typu Potrefená husa … lidovější varianta franšízy Potrefená Husa … místech alá Potrefená husa … připomínající koncept Potrefené husy či Kozlovny … O něco lepší Potrefená husa … připomíná skromnější verzi Potrefené husy … připomíná spíše restaurace typu Potrefená husa … Restaurace ve stylu Potrefené husy …

  • sometimes perceived positively

… snaží se kopírovat Potrefenou husu, ale mají na ni tak z 10 % … they are very far from their competitors (Potrefena, etc) … Used to be a decent alternative to Potrefená husa … Vyrovná (se) všem Potrefeným husám …

  • however, sometimes negatively

… má nakročeno stát se řetězcem typu Potrefená husa, kde snad i jídlo vaří roboti … neustále se opakující koncept Kozloven, Potrefených hus a podobně … po rekonstrukci přeměnila v tuctový podnik typu Potrefená husa … standardní chain, vypadá stejně jako Husa a PUBy a všechny tyto ostatní, nuda … big-brand ‘pubs’ such as Kolkovna and Potrefena husa with their big brand prices …

B2C

Where should we meet?

When working on a project we often met a favourite café. We were curious, if data analysis could suggest other places where we could meet. In Tableau we filtered restaurants with features café and free wifi and a score higher than 4,5. Then we looked at them on a map and selected an area that we would be willing to travel to to meet. We got about ten cafés, which we sorted by score and number of records and chose our five finalists. From text analysis table we chose six subjects we were interested in: place, staff, price, coffee, food and cake. Then we analysed how often people mention the subject (compared to other cafés from the list) and what attributes they associate them with.

“Staff” and “place” are generally positive and do not say anything specific about the café. The others are more interesting. Our impression currently is: Nota Bene and Mezi zrnky are cheaper than the other three. If you are hungry, go to Nota Bene. Coffee and cake are most appreciated in La Boheme Cafe.

The worst restaurant in Prague

In the beginning we were curious to know which restaurant to avoid. We filtered restaurants with the highest number of reviews with negative sentiment. From them we chose the one that had the highest ratio of reviews with negative sentiment. Then we sent most frequents phrases used in the reviews to Geneea’s Interpretor to determine their sentiment.

Most frequent phrases in Švejk’s reviews and their sentiment

Actually the most frequent phrase — “tourist trap”, is quite accurate. None of the reviews is in English.

Conclusion

Data analysis can help people make conscious decisions, both business and personal. It is an extremely powerful tool and we were happy we could learn a bit of it. What did we like most about our project? We could be creative. We could think about data from different perspectives (B2B, B2C). We used all the tools we were learning in Digital Academy on real data. We went through all steps of data processing. We had a chance to explore the beauty of text analysis. A lot of the above was possible thanks to people in Geneea, especially Petr Hamerník, Víťa Stejskal and Jirka Hana. Thank you for helping and supporting us.

There are many other questions that could be answered with our data and we still have a few of them on our minds. Unfortunatelly we did not manage to complete all our plans as Tereza could not, due to personal reasons, work on the project till the end.