Web crawling by using Selenium + Python 3

PJ Wang
CS Note
Published in
5 min readNov 14, 2017

What is web crawler?

A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing(web spidering).

Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites’ web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently.

What is Selenium?

Selenium automates browsers. That’s it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well.

Selenium has the support of some of the largest browser vendors who have taken (or are taking) steps to make Selenium a native part of their browser. It is also the core technology in countless other browser automation tools, APIs and frameworks.

Website: http://www.seleniumhq.org/

Install

Step 1: Software

Mozilla Firefox 54: Download (Selenium is not work on version 56)

Selenium IDE: Download (Use Firefox Browser)

Step 2: Environment

Python

Python 2.7 or 3 : Download

Selenium

Python 2.7:

$ pip install selenium

Python 3:

$ pip3 install selenium

Jupyter notebook

Python 2.7:

$ pip install jupyter

Python 3:

$ pip3 install jupyter

Tutorial

If we want to crawl the data from NBA, as following Website

Step 1: Inspect element

1–1: Right click and choose Inspect Element

1–2: Do as following action

1-3: Select the element you want to crawl.

Finally, the tags in HTML we want to crawl are time-stamp, game-details, and combined-score.

Step 2: Setup environment

2–1: Import selenium

First, open selenium IDE, and export test case as python 2.

And copy the code to the Jupyter.

Open the jupyter notebook

$ jupyter notebook

Create the new file and paste the copy.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import unittest, time, re

2–2: Setup driver

driver = webdriver.Firefox()

If show the error, use the following code:

driver = webdriver.Firefox(executable_path='./geckodriver.exe')

geckodriver.exe: Download

2–3: Get the HTML code from driver

driver.get("http://www.espn.com/nba/playbyplay?gameId=400974939")

Step 3: Select element

We knew that tag time-stamp , game-details , and combined-score we want. So we select them by using regular expression

# load complete HTML file
result = driver.page_source
# regular expression
re1 = re.compile('<td class="time-stamp">(.+?)</td>')
# search HTML file
time-stemp = re1.findall(result)

Then, print(time-stamp)

Appendix

Selenium package

1. Selector

1–1: Select element by id, name, tag , and class name

HTML:

<div id = "content" class="h3" name="contentBox"> ... </div>

Python:

element = driver.find_element_by_id("content")
element = driver.find_element_by_tag_name("div")
element = driver.find_element_by_name("contentBox")
element = driver.find_element_by_class_name("h3")

If you want to select multiple element, change element to elements

1-2. Select element by other attribute

Select element by link text or partial link text

HTML:

<a href="http://www.google.com/">Click</a>

Python:

element = driver.find_element_by_link_text("Click")
element = driver.find_element_by_partial_link_text("Cli")

1-3. Select element by CSS

HTML:

<div id="content"><span class="blue underline">Hello</span></div>

Python:

element = driver.find_element_by_css_selector("#content span.blue.underline")

2. Method of element

element.clear()
element.click()
element.get_attribute(name)
element.id
element.is_displayed()
element.is_enabled()
element.is_selected()
element.location
element.location_once_scrolled_into_view
element.parent
element.rect
element.size
element.submit()
element.tag_name
element.text
element.value_of_css_property(property_name)

3. Others

Sometimes, the tag is not show in driver.page_source , but it shows on the browser. The browser display lagly, so we need to wait for minutes until element displays.

WebDriverWait(driver, timeout, poll_frequency=0.5, ignored_exceptions=None)

element = WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id("Id"))

re package

Previously, we use

# regular expression
re1 = re.compile('<td class="time-stamp">(.+?)</td>')
# search HTML file
time-stemp = re1.findall(result)

The regular expression is used in there.compile() , and how to use regular expression?

1. Regular expression

Select word

[abc] means match a or b or c and only one time.

Positive
Number: [0-9] or '\d'
Lower case: [a-z]
Upper case: [A-Z]
Word: [a-zA-Z0-9_] or '\w'
Space: [\t\n\r\f\v] or '\s'
Any word except \n: '.'
Negative
Not number: [^0-9] or \D
Not lower case: [^a-z]
Not upper case: [^A-Z]
Not word: [^a-zA-Z0-9_] or '\W'
Not space: [^\t\n\r\f\v] or '\S'

Choose quantity (Use [0–9] for example)

str1 = 'pwd0123end'Zero time or one time: re.sub('[0-9]?','!',str1)
# !p!w!d!!!!e!n!d!
Zero time or one time (non-greedy): re.sub('[0-9]??','!',str1)
# !p!w!d!0!1!2!3!e!n!d!
Any times: re.sub('[0-9]*','!',str1)
# !p!w!d!e!n!d!
Any times (non-greedy): re.sub('[0-9]*?','!',str1)
# !p!w!d!0!1!2!3!e!n!d!
At least one time: re.sub('[0-9]+','!',str1)
# pwd!end
At least one time (non-greedy): re.sub('[0-9]+?','!',str1)
# pwd!!!!end
At least m times: re.sub('[0-9]{2}','!',str1)
# pwd!!end
At least m times and not bigger than n: re.sub('[0-9]{2,3}','!',str1)
# pwd!3end

Greedy: Find the longest match of sentences.

Non-greedy: Find the shortest match of sentences.

str1 = 'aAoZbAoZc'Greedy: re.sub('A.*Z','!',str1)
# a!c
Non-greedy: re.sub('A.*?Z','!',str1)
# a!b!c

Other notation

str1 = 'Hello, Gary and Henry'Start of the words ^: re.sub('^H','!',str1)
# !ello, Gary and Henry
Not match words ^: re.sub('[^H]','!',str1)
# H!!!!!!!!!!!!!!!H!!!!
End of the words $: re.sub('y$','!',str1)
# Hello, Gary and Henr!
Match group words (): re.sub('(He)','!!',str1)
# !!llo, Gary and !!nry
Or notataion | : re.sub('a|d','!!',str1)
# Hello, G!!ry !!n!! Henry

Special character (Use \ before character)

str1 = '\What does $^\' mean?/'re.sub('\$','!', str1) 
# /What does !^' mean?/
re.sub('\\\\','!', str1)
# !What does $^\' mean?/
re.sub('\?','!',str1)
# \What does $^\' mean!/

All special case \ ^ $ . | ? * + ( ) [ {.

2. RE module

str1 = '\What does $^\' mean?/'re.escape(str1)
# \\What\ is\ \$\^\'\ means\?\/
re.search('is',str1)
# <_sre.SRE_Match object; span=(6, 8), match='is'>
re.match('is',str1)
# None
re.fullmatch('is',str1)
# None
re.split('\s',str1)
# ['\\What', 'is', "$^'", 'means?/']
re.findAll('[a-z]',str1)
# ['h', 'a', 't', 'i', 's', 'm', 'e', 'a', 'n', 's']
re.sub('[a-z]','!',str1)
# \W!!! !! $^' !!!!!?/
re.subn('[a-z]','!',str1)
# ("\\W!!! !! $^' !!!!!?/", 10)

--

--

PJ Wang
CS Note

台大資工所碩畢 / 設計思考教練 / 系統思考顧問 / 資料科學家 / 新創 / 科技 + 商業 + 使用者