Web Scrapping practice-Michelin Taipei(2)

Emily Chen
Emily Chen
3 min readJan 16, 2020

--

First of all, we’d like to retrieve data from the first store. To find the HTML tag from store one, we should first find these store is contained in which block. We can find that 20 stores is contained under the div class “col-lg-12 search-results__column”, and each store is under the same class name “col-md-6 col-lg-6 col-xl-3”. That is, you can think of each store as an item list, and combined as a whole list, if you want get data from any store, just specify the list number.

However, we only want to get certain data, such as store name, store category, picture url, etc. Thus, we should apply some code to do so.

We could also convert the data we received into data frame. Same way can also be applied to get picture URL, store URL and store name.

Finally, let’s combine all the code we tried above and write it into functions.

#show detail information in each store
def showsite(siteurl):
html = requests.get(siteurl).text
soup = BeautifulSoup(html,'html.parser')
#category is under the ".card__menu-footer--price" class
kind = soup.select(".card__menu-footer--price")[0].text.strip() #delete left and right spaces
print("category:",kind)
print("url:",siteurl)
#get page url function
def getpageurl(page,url):
global n
html = requests.get(url).text
soup = BeautifulSoup(html,'html.parser')
rest = soup.find(class_="col-lg-12 search-results__column")
items = rest.find_all(class_="col-md-6 col-lg-6 col-xl-3")
#print out current page and total stores in this page
print("This is the " + str(page) + " page, total " + str(len(items)) + " stores")
link = soup.findAll("a", {"class": "link"})
leng = len(link)
urlList = []
storeList = []
siteList = []
for x in range(leng):
n+=1
print("n=",n)
urlList.append(link[x])
itemurl = urlList[x]['href']
siteurl = rooturl + itemurl
showsite(siteurl)
siteList.append(siteurl)
name = urlList[x]['aria-label'][5:]
print("Store:"+ name)
storeList.append(urlList[x]['aria-label'][5:])
import pandas as pd
pd.set_option('display.max_colwidth', -1)
table = pd.DataFrame({
"store":storeList,
"url":siteList
})
print(table)
#if you want to import csv, use the code below
#table.to_csv(r'C:\Users\user\Desktop\Jupyter\michelin_webscrap.csv', encoding='utf_8_sig')
#main code
import requests
from bs4 import BeautifulSoup
#set up user-agents
headers = requests.utils.default_headers()
headers.update({ 'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})
n = 0 #count items
homeurl = 'https://guide.michelin.com/tw/zh_TW/taipei-region/taipei/restaurants/page/2?lat=24.1506212&lon=120.6433008'
rooturl = 'https://guide.michelin.com'
#test for the sencond page
getpageurl(2,homeurl)

This is the basic practice of web scrapping, it still requires a lot of improvement, I’ll go on working. Please feel free to express any idea, opinion, or suggestion.

--

--