Web Scraping using Selenium, BeautifulSoup

Yue Weng Mak
Analytics Vidhya
Published in
6 min readApr 7, 2020
Photo by Marvin Meyer on Unsplash

As a data scientist, gathering data is part of the job.

There are a few ways that data can be obtained. If you can get a dataset from Kaggle, great! However sometimes the data is not what you want. The data may be on the site and you will like to get the information that is on the site — Introducing Web Scraping!

Web Scraping is a process of getting contents from a HTML page. The tool we are using is Selenium, that will open a browser and simulate javascript events, e.g. Click Events, Lazy Loading etc.

Setup

  1. Install Selenium

pip3 install selenium

or

conda install selenium on Jupyter Notebook

2. Download Web Driver

Depending on which browser you will like to use, you can download the appropriate web driver. For this tutorial, I will be using Chrome. Hence, I downloaded the chrome webdriver.

3. Import libraries from Selenium

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.common.exceptions import TimeoutException

4. Setup driver to open browser

In this example, I will be scraping from Vivino, a popular wine reviews site.

url = 'https://www.vivino.com/explore?e=eJwdyjkKgDAURdHdvFKMQ_k6dyBWIvKNMQSMShIcdq_Y3NNcH1hmNbzbqHJ4uVnl0A-7FvpLg4MKduEpwZkkK_aJQZLbbBzlNEGswc7ZRI0r9cM3_xSI1PICdrAezQ=='path = r'path of where the driver is located'
driver = webdriver.Chrome(executable_path = path)
driver.get(url)

Running this will open a browser using the web driver. Now you can scrape the data by looking at the HTML on the page.

BeautifulSoup

BeautifulSoup is a Python library that parses HTML data.

import copy
from bs4 import BeautifulSoup
html = BeautifulSoup(driver.page_source, 'lxml')
div = html.find("div", {"class": "explorerPage__results--3wqLw"})
rows = html.find_all("div", {"class": "explorerCard__explorerCard--3Q7_0"})
all_rows = []# Let's store each row as a dictionary
empty_row = {
"title": None, "location": None, "price": 0.0, "type": None, "ratings": None, "num_ratings": None, "reviews": None, "url": None
}
for row in rows:
new_row = copy.copy(empty_row)
# A list of all the entries in the row.
new_row['title'] = row.find("span", {"class": "vintageTitle__wine--U7t9G"}).text
location = row.find("div", {"class": "vintageLocation__vintageLocation--1DF0p"})
new_row['location'] = location.findChildren()[-1].text
price_button = row.find("button", {"class": "addToCartButton__addToCartButton--qZv9F"})
if price_button:
new_row['price'] = (float(price_button.find("span").text.replace("$", "")))
new_row['type'] = 'Rosé'
new_row['ratings'] = row.find("div", {"class": "vivinoRatingWide__averageValue--1zL_5"}).text
new_row['num_ratings'] = int(row.find("div", {"class": "vivinoRatingWide__basedOn--s6y0t"}).text.split()[0])
review_div = row.find("div", {"class": "review__note--2b2DB"})
if review_div:
new_row['reviews'] = review_div.text
clean_div = row.find("div", {"class": "cleanWineCard__cleanWineCard--tzKxV cleanWineCard__row--CBPRR"})
if clean_div:
new_row['url'] = 'https://www.vivino.com' + clean_div.find("a")['href']
all_rows.append(new_row)

Here is a sample of the subset of the results:

[{'title': 'Estate Pinot Noir 2016',
'location': 'Willamette Valley',
'price': 38.95,
'type': 'Rosé',
'ratings': '4.6',
'num_ratings': 37,
'reviews': None,
'url': 'https://www.vivino.com/shea-wine-cellars-estate-pinot-noir/w/12513?year=2016&price_id=20572438'},
{'title': 'Petite Sirah 2016',
'location': 'Paso Robles',
'price': 39.95,
'type': 'Rosé',
'ratings': '4.6',
'num_ratings': 25,
'reviews': '“So smooth and a little brighter than I was expecting, but nonetheless wonderful. Not overpowering and pretty much made for food.”',
'url': 'https://www.vivino.com/aaron-petite-sirah/w/1207985?year=2016&price_id=20183451'},
{'title': 'Las Alturas Vineyard Pinot Noir 2014',
'location': 'Santa Lucia Highlands',
'price': 36.99,
'type': 'Rosé',
'ratings': '4.5',
'num_ratings': 4516,
'reviews': '“Very very good. Almost like an Orrin swift take on Pinot. Intense sweet fruit. Quite dark for a Pinot and absolutely delicious ”',
'url': 'https://www.vivino.com/belle-glos-las-alturas-vineyard-pinot-noir/w/15239?year=2014&price_id=20092711'},
{'title': 'Clio 2017',
'location': 'Jumilla',
'price': 36.85,
'type': 'Rosé',
'ratings': '4.5',
'num_ratings': 569,
'reviews': '“A savory Spanish blend of 70% Monastrell and 30% Cab that’s spectacular!\nIntoxicating nose of blueberry, forest floor, plum and menthol. Flavor packed mouth of ripe berries, cherry pie, black licorice, and juicy plum.\nSmooth tannins for a youngster with a delicate brambly finish after a 30 minute decant.\nPairs great with Manchego cheese.\nA 4.3 rating from me but this will be special for years to come.\nEs el mejor mis amigos!!!”',
'url': 'https://www.vivino.com/el-nido-clio/w/1219218?year=2017&price_id=20240187'},
{'title': 'Cabernet Sauvignon 2018',
'location': 'Paso Robles',
'price': 39.89,
'type': 'Rosé',
'ratings': '4.5',
'num_ratings': 539,
'reviews': None,
'url': 'https://www.vivino.com/austin-hope-cabernet-sauvignon-paso-robles/w/5866389?year=2018&price_id=20557366'},
{'title': 'Las Alturas Vineyard Pinot Noir 2018',
'location': 'Santa Lucia Highlands',
'price': 34.99,
'type': 'Rosé',
'ratings': '4.5',
'num_ratings': 444,
'reviews': None,
'url': 'https://www.vivino.com/belle-glos-las-alturas-vineyard-pinot-noir/w/15239?year=2018&price_id=20711583'},
{'title': 'Dairyman Vineyard Pinot Noir 2018',
'location': 'Russian River Valley',
'price': 34.99,
'type': 'Rosé',
'ratings': '4.5',
'num_ratings': 203,
'reviews': None,
'url': 'https://www.vivino.com/belle-glos-dairyman-vineyard-pinot-noir/w/1561411?year=2018&price_id=20468469'},
{'title': "Zinfandel (Michael's Estate Vineyard) 2017",
'location': 'Paso Robles',
'price': 29.99,
'type': 'Rosé',
'ratings': '4.5',
'num_ratings': 39,
'reviews': None,
'url': 'https://www.vivino.com/adelaida-cellars-zinfandel-michael-s-estate-vineyard/w/2601471?year=2017&price_id=20646768'}]

Let’s try to break it down:

import copy
from bs4 import BeautifulSoup
  • I’m importing the library copy and BeautifulSoup
html = BeautifulSoup(driver.page_source, 'lxml')
  • Here I am converting the page source into a html string
div = html.find("div", {"class": "explorerPage__results--3wqLw"})
rows = html.find_all("div", {"class": "explorerCard__explorerCard--3Q7_0"})
  • I am grabbing the div from the html and saving it into variables.
rows = html.find_all("div", {"class": "explorerCard__explorerCard--3Q7_0"})

In this snippet, I am looking for all divs that have class `explorerCard__explorerCard — 3Q7_0` and saving it into a variable rows. It will return a list of all divs with class `explorerCard__explorerCard — 3Q7_0`

all_rows = []# Let's store each row as a dictionary 
empty_row = {
"title": None, "location": None, "price": 0.0, "type": None, "ratings": None, "num_ratings": None, "reviews": None, "url": None, "vintage": ""
}
  • Initializing an empty list all_rows and an empty dictionary empty_row with quote string as keys and values as None.
for row in rows:
new_row = copy.copy(empty_row)
# A list of all the entries in the row.
new_row['title'] = row.find("span", {"class": "vintageTitle__wine--U7t9G"}).text
location = row.find("div", {"class": "vintageLocation__vintageLocation--1DF0p"})
new_row['location'] = location.findChildren()[-1].text
price_button = row.find("button", {"class": "addToCartButton__addToCartButton--qZv9F"})
if price_button:
new_row['price'] = (float(price_button.find("span").text.replace("$", "")))
new_row['type'] = 'Rosé'
new_row['ratings'] = row.find("div", {"class": "vivinoRatingWide__averageValue--1zL_5"}).text
new_row['num_ratings'] = int(row.find("div", {"class": "vivinoRatingWide__basedOn--s6y0t"}).text.split()[0])
review_div = row.find("div", {"class": "review__note--2b2DB"})
if review_div:
new_row['reviews'] = review_div.text
clean_div = row.find("div", {"class": "cleanWineCard__cleanWineCard--tzKxV cleanWineCard__row--CBPRR"})
if clean_div:
new_row['url'] = 'https://www.vivino.com' + clean_div.find("a")['href']
all_rows.append(new_row)

In this snippet, we are iterating through the rows variable (list). For each iteration, we will create a copy of the empty dictionary. For the rest of the snippet, we are looking for the respective div, grabbing the text and saving it into the key values of the dictionary. Then we append it to our main list all_rows .

Here is the code combined:

conda install selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.common.exceptions import TimeoutException
url = 'https://www.vivino.com/explore?e=eJwdyjkKgDAURdHdvFKMQ_k6dyBWIvKNMQSMShIcdq_Y3NNcH1hmNbzbqHJ4uVnl0A-7FvpLg4MKduEpwZkkK_aJQZLbbBzlNEGswc7ZRI0r9cM3_xSI1PICdrAezQ=='
path = r'/Users/#{user}/Desktop/chromedriver' (where the driver is located)
driver = webdriver.Chrome(executable_path = path)
driver.get(url)
import copy
from bs4 import BeautifulSoup
html = BeautifulSoup(driver.page_source, 'lxml')
div = html.find("div", {"class": "explorerPage__results--3wqLw"})
rows = html.find_all("div", {"class": "explorerCard__explorerCard--3Q7_0"})
all_rows = []
# Let's store each row as a dictionary
empty_row = {
"title": None, "location": None, "price": 0.0, "type": None, "ratings": None, "num_ratings": None, "reviews": None, "url": None
}
for row in rows:
new_row = copy.copy(empty_row)
# A list of all the entries in the row.
new_row['title'] = row.find("span", {"class": "vintageTitle__wine--U7t9G"}).text
location = row.find("div", {"class": "vintageLocation__vintageLocation--1DF0p"})
new_row['location'] = location.findChildren()[-1].text
price_button = row.find("button", {"class": "addToCartButton__addToCartButton--qZv9F"})
if price_button:
new_row['price'] = (float(price_button.find("span").text.replace("$", "")))
new_row['type'] = 'Rosé'
new_row['ratings'] = row.find("div", {"class": "vivinoRatingWide__averageValue--1zL_5"}).text
new_row['num_ratings'] = int(row.find("div", {"class": "vivinoRatingWide__basedOn--s6y0t"}).text.split()[0])
review_div = row.find("div", {"class": "review__note--2b2DB"})
if review_div:
new_row['reviews'] = review_div.text
clean_div = row.find("div", {"class": "cleanWineCard__cleanWineCard--tzKxV cleanWineCard__row--CBPRR"})
if clean_div:
new_row['url'] = 'https://www.vivino.com' + clean_div.find("a")['href']
all_rows.append(new_row)

Note:

Most sites are wary of bots scraping their sites for content so be careful when using Selenium. If you are not careful, your IP address will get banned.

There you go! You are able to scrape a site using Selenium!

--

--