Lets build a web scraper for Google Maps

Anand Satheesh
5 min readMay 23, 2024

--

Web scrapers are tools or programs used to extract data from websites. They can be written in various programming languages like Python, Java, or JavaScript, and typically involve making HTTP requests to a website, parsing the HTML or other content, and then extracting the desired information.

Why Scrape Google Maps

Google Maps offers a wealth of business profile data, including addresses, ratings, phone numbers, and website addresses. Scraping Google Maps can yield a comprehensive data directory for business intelligence and market analysis. Additionally, it can be used for lead generation by providing access to business contact details.

Also as a proof of concept this project can have immediate practical uses, the other day I wanted to get my bike serviced and wasnt sure which service centre would be most suited. With this project we can enter a shop or an establishment category and location to get a list of businesses in and around the location selected. We can also fetch the details such as average ratings and contact information from this.

There are a couple of ways this can be implemented.

  1. The Blue Pill: There are third party REST APIs that work on top of Google maps to service the requried data, in which case most of our work is just parsing the json responses, creating a dataframe or converting directly to a csv or excel tabular format for better readability. You choose the blue pill, the story ends. You wake up in your bed and believe whatever you want to. Or
  2. Choose the Red Pill: you stay with this article, and I show you how deep the rabbit hole goes. Lets create a web scrapping tool from scratch where we will be exploring a bit of web browser automation and parsing the good old HTML doc to find what we are looking for.

Tools we need

  1. Python installation 3.7 or greater
  2. Any IDE of your choice(I have used PyCharm Community edition)
  3. A cup of Tea or Coffee

Python packages used

  1. argparse: The argparse module provides a way to handle command-line arguments, making it easier to write scripts that can accept user input directly from the command line. This can be incredibly useful for creating tools, utilities, and applications that require user configuration or input.
  2. playwright:Playwright is a powerful tool for automating web browsers. Developed by Microsoft, it is designed to be reliable, fast, and capable of handling modern web applications. Playwright supports multiple programming languages, including JavaScript, Python, C#, and Java, and can interact with all major browser engines: Chromium, Firefox, and WebKit. This makes it an excellent choice for cross-browser testing and automation.I have used Chromium browser for this project.
  3. dataclasses: This module provides a decorator and functions for automatically adding special methods to user-defined classes. These special methods include __init__(), __repr__(), __eq__(), and more. The primary purpose of dataclasses is to simplify the creation of classes that are primarily used to store data.
  4. pandas: Pandas is a powerful and widely-used open-source library in Python for data manipulation and analysis. It provides data structures and functions needed to work on structured data seamlessly. Built on top of NumPy, Pandas is particularly well-suited for handling tabular data, such as data stored in spreadsheets or databases.

Lets see files and classes to be created :

shop.py

from dataclasses import dataclass


@dataclass
class Shop:
shop_name: str = None
shop_location: str = None
contact_number: str = None
website: str = None
average_review_count: str = None
average_review_points: str = None

shop_data.py

from shop import Shop
from dataclasses import field, asdict
import pandas as pd
import os


class ShopData:
"""holds list of establishments and save to csv
"""
business_list: list[Shop] = []
save_at = 'output'

def dataframe(self):
return pd.json_normalize(
(asdict(shop) for shop in self.business_list), sep="_"
)

def save_to_csv(self, filename):
if not os.path.exists(self.save_at):
os.makedirs(self.save_at)
print(self.dataframe)
self.dataframe().to_csv(f"output/{filename}.csv", index=False

The below given is the code for the main.py file which will contain the bulk of the web browser as well as scraping logic

from shop import Shop
from shop_data import ShopData
import argparse
from playwright.sync_api import sync_playwright


def main():

command_line_args = argparse.ArgumentParser()
command_line_args.add_argument("--category", type=str)
command_line_args.add_argument("--location", type=str)
command_line_args.add_argument("--number", type=int)
search_parameters = command_line_args.parse_args()
category = search_parameters.category
location = search_parameters.location
number_of_items = search_parameters.number

with sync_playwright() as pw:
browser = pw.chromium.launch(headless=False)
page = browser.new_page()

page.goto("https://www.google.com/maps", timeout=10000)
page.wait_for_timeout(1000)

page.locator('//input[@id="searchboxinput"]').fill(category + " " +location)
page.wait_for_timeout(1000)

page.keyboard.press("Enter")
page.wait_for_timeout(1000)

page.hover('//a[contains(@href, "https://www.google.com/maps/place")]')

previously_counted = 0
while True:
page.mouse.wheel(0, 10000)
page.wait_for_timeout(3000)

if (
page.locator(
'//a[contains(@href, "https://www.google.com/maps/place")]'
).count()
>= number_of_items
):
listings = page.locator(
'//a[contains(@href, "https://www.google.com/maps/place")]'
).all()[:number_of_items]

print(f"Total Scraped: {len(listings)}")
break
else:
# logic to break from loop to not run infinitely
# in case arrived at all available listings
if (
page.locator(
'//a[contains(@href, "https://www.google.com/maps/place")]'
).count()
== previously_counted
):
listings = page.locator(
'//a[contains(@href, "https://www.google.com/maps/place")]'
).all()
print(f"Arrived at all available\nTotal Scraped: {len(listings)}")
break
else:
previously_counted = page.locator(
'//a[contains(@href, "https://www.google.com/maps/place")]'
).count()
print(
f"Currently Scraped: ",
page.locator(
'//a[contains(@href, "https://www.google.com/maps/place")]'
).count(),
)

shop_data_obj = ShopData()

for listing in listings:
listing.click()
page.wait_for_timeout(5000)

name_attribute = 'aria-label'
location_xpath = '//button[@data-item-id="address"]//div[contains(@class, "fontBodyMedium")]'
website_xpath = '//a[@data-item-id="authority"]//div[contains(@class, "fontBodyMedium")]'
contact_number_xpath = '//button[contains(@data-item-id, "phone:tel:")]//div[contains(@class, "fontBodyMedium")]'
average_review_count_xpath = '//div[@jsaction="pane.reviewChart.moreReviews"]//span'
average_review_points_xpath = '//div[@jsaction="pane.reviewChart.moreReviews"]//div[@role="img"]'

shop_obj = Shop()

if len(listing.get_attribute(name_attribute)) >= 1:

shop_obj.shop_name = listing.get_attribute(name_attribute)
else:
shop_obj.shop_name = ""
if page.locator(location_xpath).count() > 0:
shop_obj.shop_location = page.locator(location_xpath).all()[0].inner_text()
else:
shop_obj.shop_location = ""
if page.locator(website_xpath).count() > 0:
shop_obj.website = page.locator(website_xpath).all()[0].inner_text()
else:
shop_obj.website = ""
if page.locator(contact_number_xpath).count() > 0:
shop_obj.contact_number = page.locator(contact_number_xpath).all()[0].inner_text()
else:
shop_obj.contact_number = ""
if page.locator(average_review_count_xpath).count() > 0:
shop_obj.average_review_count = int(
page.locator(average_review_count_xpath).inner_text()
.split()[0]
.replace(',', '')
.strip()
)
else:
shop_obj.average_review_count = ""

if page.locator(average_review_points_xpath).count() > 0:
shop_obj.average_review_points = float(
page.locator(average_review_points_xpath).get_attribute(name_attribute)
.split()[0]
.replace(',', '.')
.strip())
else:
shop_obj.average_review_points = ""

shop_data_obj.business_list.append(shop_obj)

shop_data_obj.save_to_csv(f"{category}_in_{location}_google_maps")
browser.close()


if __name__ == "__main__":
main()

To execute this just enter the following command in your terminal:

python main.py --category "yamaha service center" --location trivandrum --number 5

Since we have used the playwright package and I have added wait_for_timeout in between you can see the automation running in your machine to open google maps in chromium and fetching the needed details. The final result:

csv file created in your local for the search category and location

Hope you enjoyed the post, Thanks. Happy Coding!!

The full code and details will be available in my github repo:

https://github.com/anands282/webscrapper

--

--

Anand Satheesh

Experienced software architect with 11+ years in big data. Passionate about building scalable solutions and driving innovation.