Web Scrapping practice-Michelin Taipei(1)

Emily Chen
Emily Chen
2 min readJan 16, 2020

--

In most of the situation, it is unlikely that we get a cleaned CSV or excel file at the beginning of the data collecting process. Therefore, web scraping is an essential skill for companies who would like to retrieve data automatically.

In this practice, I’ll demonstrate how to scrap basic data from Michelin website. To make things more simple, I only focus on scrapping data from restaurants in Taipei. In the website interface, there’re 20 restaurants listed on one page.

1–20 stores listed on the first page

The first 20 restaurants are listed in the first page. Therefore, the homepage url(https://guide.michelin.com/tw/zh_TW/taipei-region/taipei/restaurants?lat=24.1506212&lon=120.6433008) and the first page url(https://guide.michelin.com/tw/zh_TW/taipei-region/taipei/restaurants/page/1?lat=24.1506212&lon=120.6433008) are identical.

If we want to check for the 21–40 stores, we could simply change the page number to 2(https://guide.michelin.com/tw/zh_TW/taipei-region/taipei/restaurants/page/2?lat=24.1506212&lon=120.6433008)

Before scrapping, we should follow the robot.txt file for a website. According to ScrapeHero, robot.txt has specific rules for good behavior such as how frequently you can scrape, which pages allow scraping, and which ones you can’t. In the robot.txt file(https://guide.michelin.com/robots.txt), we could see there’re a couple of thing that we can’t get access to.

Micheline robot.txt file

Also, to avoid being blocked by website manager, there’re a few way to prevent. In this exercise, I choose to spoof a user-agent, a way listed in the ScrapeHero website.

Set user-agent to a common web browser: Google Bot

The basic rule of scrapping is to find the HTML tag of the website. For example, if we want to know the first store list is under which class, we can move our mouse to the store, right click it, and choose inspect(Ctrl+shift+I), we’ll then find the HTML tag.

In the next paragraph, let’s start to retrieve data from the HTML tag.

--

--