Python: Selenium Speed Scraping

dmitriiweb
2 min readOct 15, 2017

--

Sometimes in my work I should use selenium for scraping the different websites, but this tool is too slow.

This life hack is blowing in the wind, but for the last two years I have worked with many scrapers, written by others, and I never saw anyone using it.

As you may know, the main reason why the selenium works slow is nasty parser, so the first thing that comes to mind is to change parser in selenium.

To show you how it works, I will use selenium with chromedriver, beautifulsoup4 and this page in Wikipedia, which contains the table with some information about U.S. states.

So, for beginning I made the script which use only selenium for extracting data from the table:

from datetime import datetime
from selenium import webdriver


def get_states():
url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
driver = webdriver.Chrome()
driver.get(url)

rows = driver.find_element_by_xpath('//*[@id="mw-content-text"]/div/table[1]/tbody')\
.find_elements_by_tag_name('tr')

states = []

for row in rows:
cells = row.find_elements_by_tag_name('td')
name = row.find_element_by_tag_name('th').text
abbr = cells[0].text
established = cells[-9].text
population = cells[-8].text
total_area_km = cells[-6].text
land_area_km = cells[-4].text
water_area_km = cells[-2].text

states.append([
name, abbr, established, population, total_area_km, land_area_km,
water_area_km
])
driver.quit()
return states


if __name__ == '__main__':
start = datetime.now()
states = get_states()
finish = datetime.now() - start
print(finish)

If you run this script, you will see that the implementation took 27.177188 seconds.

Now let’s try to change standart selenium parser to BeautifulSoup. Simply, exctract html markup from webdriver and send it to BeautifulSoup:

bs_obj = BeautifulSoup(driver.page_source, 'html.parser')

As a result we will get this code:

from datetime import datetime
from selenium import webdriver
from bs4 import BeautifulSoup as BSoup


def get_states():
url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
driver = webdriver.Chrome()
driver.get(url)

bs_obj = BSoup(driver.page_source, 'html.parser')
rows = bs_obj.find_all('table')[0].find('tbody').find_all('tr')

states = []

for row in rows:
cells = row.find_all('td')
name = row.find('th').get_text()
abbr = cells[0].get_text()
established = cells[-9].get_text()
population = cells[-8].get_text()
total_area_km = cells[-6].get_text()
land_area_km = cells[-4].get_text()
water_area_km = cells[-2].get_text()

states.append([
name, abbr, established, population, total_area_km, land_area_km,
water_area_km
])
driver.quit()
return states


if __name__ == '__main__':
start = datetime.now()
states = get_states()
finish = datetime.now() - start
print(finish)

And after running we see… 07.664240 seconds! The result is 3 times faster!

So selenium is a cool tool if you need to simulate user’s actions, but if you work with big or huge arrays of data from webpage(s), then the best way is to delegate this task to faster parsers.

--

--

dmitriiweb

Professional programmer on Python, Data Engineer, Freelancer