Introduction to Web Scraping using Selenium

In this tutorial you’ll learn how to scrape websites with Selenium and ChromeDriver.

Roger Taracha
Sep 4, 2017 · 6 min read
Photo by Luca Bravo on Unsplash

What is Web Scraping?

Uses Cases of Web Scraping:

What is Selenium?

Point To Note

What will we build?

What will we require?

Project SetUp

Screenshot of project folder structure.
$ virtualenv webscraping_example
$(webscraping_example) pip install -r setup.py

Import Required Modules

from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

Create new instance of Chrome in Incognito mode

option = webdriver.ChromeOptions()
option.add_argument(“ — incognito”)
browser = webdriver.Chrome(executable_path=’/Library/Application Support/Google/chromedriver’, chrome_options=option)

Make The Request

browser.get(“https://github.com/TheDancerCodes")
# Wait 20 seconds for page to load
timeout = 20
try:
WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, “//img[@class=’avatar width-full rounded-2']”)))
except TimeoutException:
print(“Timed out waiting for page to load”)
browser.quit()
We pass in the <img> tag and its class to the WebDriverWait() function as the XPATH in the code snippet above

Get The Response

# find_elements_by_xpath returns an array of selenium objects.titles_element = browser.find_elements_by_xpath(“//a[@class=’text-bold’]”)# use list comprehension to get the actual repo titles and not the selenium objects.titles = [x.text for x in titles_element]# print out all the titles.print('titles:')
print(titles, '\n')
We pass in the <a> tag and its class to the find_elements_by_xpath() function in the code snippet above
language_element = browser.find_elements_by_xpath(“//p[@class=’mb-0 f6 text-gray’]”)# same concept as for list-comprehension above.
languages = [x.text for x in language_element]
print(“languages:”)
print(languages, ‘\n’)
We pass in the <p> tag and its class to the find_elements_by_xpath() function in the code snippet above

Combine the responses using zip function

for title, language in zip(titles, languages):
print(“RepoName : Language”)
print(title + ": " + language, '\n')

Run the program

Chrome is being controlled by automated test software.
TITLES:
['Github-Api-Challenge', 'python-unit-tests-tutorial', 'KenyawebApp', 'filamu-app']
LANGUAGES:
['Java', 'Python 1 1', 'Java', 'Java']
RepoName : Language
Github-Api-Challenge: Java
RepoName : Language
python-unit-tests-tutorial: Python 1 1
RepoName : Language
KenyawebApp: Java
RepoName : Language
filamu-app: Java

You now have the power to scrape! 💪

If this post was helpful, please click the clap 👏 button below a few times to show your support. ⬇⬇

The Andela Way

A pool of thoughts from the brilliant people at Andela