Scraping all Restaurant Listings from TripAdvisor for a given city using Python

Anshika Nigam
9 min readMar 16, 2023

--

Using BeautifulSoup, Requests, Seaborn, and Pandas

Web Scraping using Python

Using web scraping, we can extract data like product prices, ratings and other types of information from websites. We can then use this data for various purposes like data analysis, research, business intelligence, and data science. In Python, web scraping is often done using libraries like Beautiful Soup, Scrapy, and Requests, which make it easier to retrieve and parse data from web pages.

But wait, this is not just another scraping project!

This project aims to scrape restaurant details from any city in any country, along with a tadka of Exploratory Data Analysis from the scraped data. If this doesn’t make you the pundit of web scraping using Python, what will? ;)

How would the scraped csv look like?

The scraped restaurant data would contain:

  • Name of the Restaurant
  • Total Reviews
  • Star/Bubble Rating
  • Cuisines

Other useful information that can be used for debugging includes page number, restaurant serial number, and data offset.

The above Scraped DataFrame is of Berlin, Germany.

Control Variable/Input Parameters

In this project, we have selected Berlin, Germany. For example, if we wish to scrape all restaurants in Bangalore, Karnataka, we will filter the same on Tripadvisor, and we will get a link that would look something like this:
https://www.tripadvisor.in/Restaurants-g297628-Bengaluru_Bangalore_District_Karnataka.html".
In this link, “297628” is the geo-code, and “Bengaluru_Bangalore_District_Karnataka” is the city name. If we notice, on Tripadvisor, it shows around 11,127 restaurants for Bengaluru. So now, our input parameters would be:

  • Geo Code
  • City Name
  • Upper Data Offset

Now that we have selected a city along with its geo-code of our choice, let’s proceed with the script. The first step is to import the required libraries. (Install if needed.)
The next step is to define the control variables. Since we will be scraping restaurant data from Berlin, Germany, we will define the variables accordingly. Also, there are a total of 30 restaurants listed per page on TripAdvisor, which constitutes our page size. The last page has the data offset of 6330, which would be our data offset upper limit.
These control variables will change according to the city we are trying to scrape.

pip install "requests_html"
pip install "bs4"
# Import Libraries
import functools
import time

import pandas as pd
import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession



# store the control variables in a variable
scraping_control_variables = {
'city_name' : 'Berlin',
'geo_code' : '187323',
'data_offset_lower_limit' : 0,
'data_offset_upper_limit' : 6330,
'page_num' : 0,
'page_size' : 30
}

There are a total of 10 functions that will be used in this script.

  • get_url
  • get_soup_content
  • get_card
  • parse_tripadvisor
  • get_restaurant_data_from_card
  • scrape_star_ratings
  • scrape_reviews
  • scrape_cuisines
  • scrape_title
  • save_to_csv

Let’s go through each function one by one.

get_url

get_url takes geo-code, data offset, and city name as inputs and creates a different url for every page to be scraped. The url follows a pattern as the data offset is a multiple of 30. For Example,

# Function to get URL for every page

def get_url(gc, do, city):
data_offset_var = '-oa'+str(do)
if do == 0:
data_offset_var = ''
url = f"https://www.tripadvisor.in/RestaurantSearch-g{gc}{data_offset_var}-a_date.2023__2D__03__2D__05-a_people.2-a_time.20%3A00%3A00-a_zur.2023__5F__03__5F__05-{city}.html#EATERY_LIST_CONTENTS"
print("URL to be scraped: ","\n", url, "\n")
return url

get_soup_content

get_soup_content takes geo-code, data-offset and city name as an input to call get_url function. Also, it creates a response object using the url obtained. Once the HTML is accessible, we need to parse the HTML and load it into a BS4 structure. This soup object is very handy and allows us to access useful pieces of information such as title, cuisines, ratings etc.

# Function to get soup content

def get_soup_content(gc, do, city):
url = get_url(gc, do, city)
# start the HTML session
print("HTML session started")
r = HTMLSession()
response_obj = r.get(url,verify=False)
soup_content = BeautifulSoup(response_obj.content, "html.parser")
return soup_content

get_card

get_card function will help in getting the individual restaurant cards according to the restaurant serial number or restaurant count. The card tags follow this pattern: 1_list_item, 2_list_item, 3_list_item, etc.
*Screenshot for reference*

Restaurant Card Tag
# Function to get individual restaurant cards

def get_card(rest_cnt, soup_content):
card_tag = f"{rest_cnt}_list_item"
print(f"Scraping item number: {card_tag}")
card = soup_content.find("div",{"data-test":card_tag})
return card

parse_tripadvisor

The parse_tripadvisor function takes control variables defined in earlier steps as an input. This is one of the most important functions in the script. Variables data_offset_lower_limit, data_offset_upper_limit, page_num, page_size, geo_code, and city_name take their values from the scraping_control_variables dictionary. The value of data_offset_current is set to the value of data_offset_lower_limit, which is incremented by 30 for each page in the following loop.
The while loop runs until the last page to be scraped (around 212 pages). page_start_offset and page_end_offset will take values like (0, 31), (31, 61), (61, 91), etc. As each page generally contains 30 restaurants. But since we can’t be completely sure if the page would contain fewer than 30 restaurants, we have also included the if condition inside the loop, which takes care of that. The function get_restaurant_data_from_card is used to scrape restaurant details and append them to an empty list, restaurants_scraped.

# parse each restaurant card
def parse_tripadvisor(scraping_control_variables):
restaurants_scraped = []
data_offset_lower_limit = scraping_control_variables['data_offset_lower_limit']
data_offset_upper_limit = scraping_control_variables['data_offset_upper_limit']
page_num = scraping_control_variables['page_num']
page_size = scraping_control_variables['page_size']
geo_code = scraping_control_variables['geo_code']
city_name = scraping_control_variables['city_name']

data_offset_current = data_offset_lower_limit

while data_offset_current <= data_offset_upper_limit :
print("Scraping Page Number: ", page_num)
print("Scraping Data Offset: ", data_offset_current)
page_start_offset = (page_num*page_size) + 1
page_end_offset = (page_num*page_size) + page_size + 1
soup_content = get_soup_content(geo_code, data_offset_current , city_name)
for rest_cnt in range(page_start_offset , page_end_offset):
card = get_card(rest_cnt, soup_content)
if card is None:
break
restaurant_data = get_restaurant_data_from_card(rest_cnt, data_offset_current, page_num, card)
restaurants_scraped.append(restaurant_data)
print("Scraping Completed for Page Number: ", page_num, "\n" )
print("Data Offset: ", data_offset_current)
page_num = page_num + 1
data_offset_current = data_offset_current + 30
return restaurants_scraped

get_restaurant_data_from_card

get_restaurant_data_from_card function takes the restaurant count, current data offset, page number, and card number as input and calls individual scrape functions created to get restaurant details.

# call scrape functions and store it in a dictionary

def get_restaurant_data_from_card(rest_cnt, data_offset_current, page_num, card):
restaurant_data = {
'title': scrape_title(card),
'cuisines': scrape_cuisines(card) ,
'reviews': scrape_reviews(card),
'star rating': scrape_star_ratings(card),
'page number': page_num,
'data offset': data_offset_current,
'restaurant serial number': rest_cnt
}

return restaurant_data

Scraping Functions to get Restaurant Details

All the below functions take the card as input which contain all information pertaining to a particular restauarant.

  • scrape_star_ratings (gets the star/customer rating of the restaurant)
  • scrape_reviews (gets the total reviews of the restaurant)
  • scrape_cuisines (gets all the cuisines offered by the restaurant)
  • scrape_title (gets the name of the restaurant)
Title Tag
Rating Tag
Reviews Tag
Cuisines Tag
# Scraping Functions

def scrape_star_ratings(card):
star_rating = card.find_all('svg',class_ = "UctUV d H0")
scraped_star_ratings = star_rating[0]['aria-label'] if len(star_rating) >= 1 else None
return scraped_star_ratings


def scrape_reviews(card):
reviews = card.find_all('span', class_ = "IiChw")
scraped_reviews = reviews[-1].text if len(reviews) >= 1 else None
return scraped_reviews


def scrape_cuisines(card):
cu_1 = card.find('div', class_ = 'hBcUX XFrjQ mIBqD')
try:
scraped_cuisines = cu_1.find('span', class_ = 'SUszq').get_text()
except AttributeError:
scraped_cuisines = None
return scraped_cuisines


def scrape_title(card):
title = card.find_all('div', class_ = 'RfBGI')
scraped_title = None if len(title) < 1 else title[0].text
return scraped_title

Saving the scraped file as a CSV

Finally, let’s save the dataframe as a csv in our local repository. This csv can be used for any data analysis and data science projects.

def save_to_csv(restaurants_scraped):
# finally, store the output into a csv file
print("storing the data in csv")
output_df = pd.DataFrame(restaurants_scraped)
output_df.drop_duplicates(inplace=True)
output_df.to_csv("ta_berlin_restaurants_scraped.csv", index= False)
print("csv stored")

def scrape_and_save(scraping_control_variables):
restaurants_scraped = parse_tripadvisor(scraping_control_variables)
save_to_csv(restaurants_scraped)
return restaurants_scraped

# disable secure certificate warnings
requests.packages.urllib3.disable_warnings()
restaurants_scraped = scrape_and_save(scraping_control_variables)
# Print Dataframe
scraped_df = pd.DataFrame(restaurants_scraped)
print("Restaurant Data Scraped:\n",scraped_df.head(20) )
Script Output
# Let's check the total restaurants scraped

print("Total Restaurants scraped:\t", len(scraped_df))

We are not done yet!

Let’s do some Exploratory Data Analysis on the scraped data. We will try to plot the following analysis using Seaborn.

  • Top 10 most popular cuisines in Berlin, Germany
  • Number of reviews vs. star rating of a restaurant in Berlin, Germany

The clean_dataframe function does the job of cleaning the scraped output dataframe, such as splitting the serial number from the restaurant name, dropping unnecessary columns, splitting cuisines (as they are concatenated in the same row with a comma), and removing noise from some columns.

The scatter_plot_viz function creates a bar plot of popular cuisines in Berlin using Seaborn. It shows the best places to eat in Berlin by visualizing the relationship between rating and number of reviews. As depicted in the plot, we would prefer the restaurants with high ratings and a high number of reviews.

The popular_cuisines function creates a bar plot for most popular cuisines by aggregating the dataset by cuisines count. To get the cuisine count, we first need to split each cuisine separated by comma and push it into individual rows.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize'] = [12, 8]

# Take the input as the scraped output
ta_restaurants = scraped_df

def clean_dataframe(df):
# clean title, split sr_no and name
df[['sr_no','restaurant_name']] = df["title"].str.split(" ", 1, expand=True)
df["restaurant_name"] = df["restaurant_name"].str.strip(" ")

# drop unnecessary columns
df = df.drop('sr_no', axis=1)
df = df.drop('page number', axis=1)
df = df.drop('data offset', axis=1)

# split cuisines
df[['cuisine_1','cuisine_2']] = df["cuisines"].str.split(",", expand=True)
df = df.melt(id_vars=["title", "cuisines", "reviews", "star rating", "restaurant serial number", "restaurant_name"],
var_name="cuisines_melt",
value_name="cuisines_all"
)
# clean columns
df["reviews"] = df["reviews"].str.replace('reviews', '').str.replace('review', '').str.replace(',', '').str.strip(" ")
df["star rating"] = df["star rating"].str.replace(' of 5 bubbles', '').str.strip(" ")
df["cuisines_all"] = df["cuisines_all"].str.replace('₹', '').str.replace('₹₹ - ₹₹₹', '').str.replace('-', '').str.strip(" ")
return df
Clean Dataframe
def popular_cuisines(df):
# df_popular_cuisines = df.where(df['cuisines_all'] != '').groupby(['cuisines_all'])['restaurant_name'].nunique().sort_values(ascending=False).head(20)
df_popular_cuisines = df.where(df['cuisines_all'] != '').groupby("cuisines_all").agg(total_restaurants_offering_cuisines=('restaurant_name', 'nunique'))
df_popular_cuisines = df_popular_cuisines.sort_values(by = ["total_restaurants_offering_cuisines"],ascending=False).head(10)
df_popular_cuisines['cuisines'] = df_popular_cuisines.index
df_popular_cuisines = df_popular_cuisines.reset_index(drop=True)
print(df_popular_cuisines.head(10))
ax_1 = sns.barplot(data=df_popular_cuisines, x="cuisines", y="total_restaurants_offering_cuisines")
ax_1.set_title('Popular Cuisines in Berlin, Germany', size = 14, font = 'sans', fontweight='bold')
ax_1.set_xlabel("Cuisines", size = 10, font = 'sans', fontweight='bold')
ax_1.set_ylabel("Total Restaurants Offering Cuisines", size = 10, font = 'sans', fontweight='bold')
for i in ax_1.containers:
ax_1.bar_label(i,)
plt.show()
return df_popular_cuisines
Popular Cuisines Dataframe
Popular Cuisines Viz
def scatter_plot_viz(df):
df_subset = df[["restaurant_name","reviews","star rating"]]
df_subset = df_subset.drop_duplicates()
df_subset['reviews'] = df_subset['reviews'].fillna(0).astype(int)
df_subset['star rating'] = df_subset['star rating'].fillna(0).astype(float)
df_subset = df_subset.where(df_subset['star rating'] != -1.0)
print(df_subset.head(10))
sns.set( style = "whitegrid" )
ax = sns.scatterplot(x="reviews",
y="star rating",
data=df_subset.sort_values("reviews", ascending= False),
style="star rating",
hue ="star rating",
palette = 'tab10'
)
ax.set_title('Restaurant Rating v/s No. of Reviews', size = 14, font = 'sans', fontweight='bold')
ax.set_xlabel("Reviews", size = 10, font = 'sans', fontweight='bold')
ax.set_ylabel("Ratings", size = 10, font = 'sans', fontweight='bold')
plt.grid(color = 'grey', linestyle = '--', linewidth = 0.5)
ax.set(xscale="linear")
ax.set(xlim = (100,8000))
ax.set(ylim = (1,6))
plt.show()
Scatter Plot Dataframe
Ratings vs Reviews Viz
ta_restaurants_clean = clean_dataframe(ta_restaurants)
popular_cuisines_df = popular_cuisines(ta_restaurants_clean)
scatter_plot_viz(ta_restaurants_clean)

Future Work:

  • Automate the scraping process to collect data periodically for monitoring changes in restaurant information.
  • Use natural language processing techniques to analyze restaurant reviews and ratings to generate insights about customer sentiment.
  • Develop a recommendation system for restaurants based on user preferences and restaurant attributes.

Thank you for taking the time to read my blog post. I hope that it has helped you understand web scraping using Python.
I invite you to continue following my work and engaging with me as we explore Data Science together.
Thank you for being a part of this journey and grateful for every feedback.

--

--