What is web crawler?
A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing(web spidering).
Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites’ web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently.
What is Selenium?
Selenium automates browsers. That’s it! What you do with that power is entirely up to you. Primarily, it is for automating web applications for testing purposes, but is certainly not limited to just that. Boring web-based administration tasks can (and should!) be automated as well.
Selenium has the support of some of the largest browser vendors who have taken (or are taking) steps to make Selenium a native part of their browser. It is also the core technology in countless other browser automation tools, APIs and frameworks.
Website: http://www.seleniumhq.org/
Install
Step 1: Software
Mozilla Firefox 54: Download (Selenium is not work on version 56)
Selenium IDE: Download (Use Firefox Browser)
Step 2: Environment
Python
Python 2.7 or 3 : Download
Selenium
Python 2.7:
$ pip install selenium
Python 3:
$ pip3 install selenium
Jupyter notebook
Python 2.7:
$ pip install jupyter
Python 3:
$ pip3 install jupyter
Tutorial
If we want to crawl the data from NBA, as following Website
Step 1: Inspect element
1–1: Right click and choose Inspect Element
1–2: Do as following action
1-3: Select the element you want to crawl.
Finally, the tags in HTML we want to crawl are time-stamp
, game-details
, and combined-score
.
Step 2: Setup environment
2–1: Import selenium
First, open selenium IDE, and export test case as python 2.
And copy the code to the Jupyter.
Open the jupyter notebook
$ jupyter notebook
Create the new file and paste the copy.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import unittest, time, re
2–2: Setup driver
driver = webdriver.Firefox()
If show the error, use the following code:
driver = webdriver.Firefox(executable_path='./geckodriver.exe')
geckodriver.exe: Download
2–3: Get the HTML code from driver
driver.get("http://www.espn.com/nba/playbyplay?gameId=400974939")
Step 3: Select element
We knew that tag time-stamp
, game-details
, and combined-score
we want. So we select them by using regular expression
# load complete HTML file
result = driver.page_source# regular expression
re1 = re.compile('<td class="time-stamp">(.+?)</td>')# search HTML file
time-stemp = re1.findall(result)
Then, print(time-stamp)
Appendix
Selenium package
1. Selector
1–1: Select element by id, name, tag , and class name
HTML:
<div id = "content" class="h3" name="contentBox"> ... </div>
Python:
element = driver.find_element_by_id("content")
element = driver.find_element_by_tag_name("div")
element = driver.find_element_by_name("contentBox")
element = driver.find_element_by_class_name("h3")
If you want to select multiple element, change element
to elements
1-2. Select element by other attribute
Select element by link text or partial link text
HTML:
<a href="http://www.google.com/">Click</a>
Python:
element = driver.find_element_by_link_text("Click")
element = driver.find_element_by_partial_link_text("Cli")
1-3. Select element by CSS
HTML:
<div id="content"><span class="blue underline">Hello</span></div>
Python:
element = driver.find_element_by_css_selector("#content span.blue.underline")
2. Method of element
element.clear()
element.click()
element.get_attribute(name)
element.id
element.is_displayed()
element.is_enabled()
element.is_selected()
element.location
element.location_once_scrolled_into_view
element.parent
element.rect
element.size
element.submit()
element.tag_name
element.text
element.value_of_css_property(property_name)
3. Others
Sometimes, the tag is not show in driver.page_source
, but it shows on the browser. The browser display lagly, so we need to wait for minutes until element displays.
WebDriverWait(driver, timeout, poll_frequency=0.5, ignored_exceptions=None)
element = WebDriverWait(driver, 10).until(lambda x: x.find_element_by_id("Id"))
re package
Previously, we use
# regular expression
re1 = re.compile('<td class="time-stamp">(.+?)</td>')# search HTML file
time-stemp = re1.findall(result)
The regular expression is used in there.compile()
, and how to use regular expression?
1. Regular expression
Select word
[abc]
means match a or b or c
and only one time.
Positive
Number: [0-9] or '\d'
Lower case: [a-z]
Upper case: [A-Z]
Word: [a-zA-Z0-9_] or '\w'
Space: [\t\n\r\f\v] or '\s'
Any word except \n: '.'Negative
Not number: [^0-9] or \D
Not lower case: [^a-z]
Not upper case: [^A-Z]
Not word: [^a-zA-Z0-9_] or '\W'
Not space: [^\t\n\r\f\v] or '\S'
Choose quantity (Use [0–9] for example)
str1 = 'pwd0123end'Zero time or one time: re.sub('[0-9]?','!',str1)
# !p!w!d!!!!e!n!d!Zero time or one time (non-greedy): re.sub('[0-9]??','!',str1)
# !p!w!d!0!1!2!3!e!n!d!Any times: re.sub('[0-9]*','!',str1)
# !p!w!d!e!n!d!Any times (non-greedy): re.sub('[0-9]*?','!',str1)
# !p!w!d!0!1!2!3!e!n!d!At least one time: re.sub('[0-9]+','!',str1)
# pwd!endAt least one time (non-greedy): re.sub('[0-9]+?','!',str1)
# pwd!!!!endAt least m times: re.sub('[0-9]{2}','!',str1)
# pwd!!endAt least m times and not bigger than n: re.sub('[0-9]{2,3}','!',str1)
# pwd!3end
Greedy: Find the longest match of sentences.
Non-greedy: Find the shortest match of sentences.
str1 = 'aAoZbAoZc'Greedy: re.sub('A.*Z','!',str1)
# a!cNon-greedy: re.sub('A.*?Z','!',str1)
# a!b!c
Other notation
str1 = 'Hello, Gary and Henry'Start of the words ^: re.sub('^H','!',str1)
# !ello, Gary and HenryNot match words ^: re.sub('[^H]','!',str1)
# H!!!!!!!!!!!!!!!H!!!!End of the words $: re.sub('y$','!',str1)
# Hello, Gary and Henr!Match group words (): re.sub('(He)','!!',str1)
# !!llo, Gary and !!nryOr notataion | : re.sub('a|d','!!',str1)
# Hello, G!!ry !!n!! Henry
Special character (Use \ before character)
str1 = '\What does $^\' mean?/'re.sub('\$','!', str1)
# /What does !^' mean?/re.sub('\\\\','!', str1)
# !What does $^\' mean?/re.sub('\?','!',str1)
# \What does $^\' mean!/
All special case \ ^ $ . | ? * + ( ) [ {
.
2. RE module
str1 = '\What does $^\' mean?/'re.escape(str1)
# \\What\ is\ \$\^\'\ means\?\/re.search('is',str1)
# <_sre.SRE_Match object; span=(6, 8), match='is'>re.match('is',str1)
# Nonere.fullmatch('is',str1)
# Nonere.split('\s',str1)
# ['\\What', 'is', "$^'", 'means?/']re.findAll('[a-z]',str1)
# ['h', 'a', 't', 'i', 's', 'm', 'e', 'a', 'n', 's']re.sub('[a-z]','!',str1)
# \W!!! !! $^' !!!!!?/re.subn('[a-z]','!',str1)
# ("\\W!!! !! $^' !!!!!?/", 10)