Web Scrapping practice-Michelin Taipei(2)

Published in

Emily Chen

3 min readJan 16, 2020

First of all, we’d like to retrieve data from the first store. To find the HTML tag from store one, we should first find these store is contained in which block. We can find that 20 stores is contained under the div class “col-lg-12 search-results__column”, and each store is under the same class name “col-md-6 col-lg-6 col-xl-3”. That is, you can think of each store as an item list, and combined as a whole list, if you want get data from any store, just specify the list number.

However, we only want to get certain data, such as store name, store category, picture url, etc. Thus, we should apply some code to do so.

We could also convert the data we received into data frame. Same way can also be applied to get picture URL, store URL and store name.

Finally, let’s combine all the code we tried above and write it into functions.

#show detail information in each store
def showsite(siteurl):
    html = requests.get(siteurl).text
    soup = BeautifulSoup(html,'html.parser')
    #category is under the ".card__menu-footer--price" class
    kind = soup.select(".card__menu-footer--price")[0].text.strip() #delete left and right spaces
    print("category:",kind)
    print("url:",siteurl)
#get page url function
def getpageurl(page,url):
    global n
    html = requests.get(url).text
    soup = BeautifulSoup(html,'html.parser')
    rest = soup.find(class_="col-lg-12 search-results__column")
    items = rest.find_all(class_="col-md-6 col-lg-6 col-xl-3")
    #print out current page and total stores in this page
    print("This is the " + str(page) + " page, total " + str(len(items)) + " stores")
    link = soup.findAll("a", {"class": "link"})
    leng = len(link)
    urlList = []
    storeList = []
    siteList = []
    for x in range(leng):
        n+=1
        print("n=",n)
        urlList.append(link[x])
        itemurl = urlList[x]['href']
        siteurl = rooturl + itemurl
        showsite(siteurl)
        siteList.append(siteurl)
        name = urlList[x]['aria-label'][5:]
        print("Store:"+ name)
        storeList.append(urlList[x]['aria-label'][5:])      
    import pandas as pd
    pd.set_option('display.max_colwidth', -1)
    table = pd.DataFrame({
        "store":storeList,
        "url":siteList
    })
    print(table)
    #if you want to import csv, use the code below
    #table.to_csv(r'C:\Users\user\Desktop\Jupyter\michelin_webscrap.csv', encoding='utf_8_sig')
#main code
import requests
from bs4 import BeautifulSoup
#set up user-agents
headers = requests.utils.default_headers()
headers.update({ 'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})
n = 0 #count items
homeurl = 'https://guide.michelin.com/tw/zh_TW/taipei-region/taipei/restaurants/page/2?lat=24.1506212&lon=120.6433008'
rooturl = 'https://guide.michelin.com'
#test for the sencond page
getpageurl(2,homeurl)

This is the basic practice of web scrapping, it still requires a lot of improvement, I’ll go on working. Please feel free to express any idea, opinion, or suggestion.

Web Scrapping practice-Michelin Taipei(2)

Written by Emily Chen