Pro tips for Selenium

Igor Zabukovec
4 min readJan 22, 2019

--

Learn how to execute javascript, use proxies, limit your bandwidth, go headless and much more…

In this last tutorial on Selenium, we will cover more advanced concepts that are very useful when crawling at large scale.

This tutorial is aimed for experienced readers: if you have never used Selenium before, I recommend checking the first two tutorials:

Execute Javascript

One of Selenium’s strength is that it can inject Javascript in the DOM.

Click on buttons

There are several ways to click on an element:

  • element.click()
  • execute the following script: “arguments[0].click();”
el = driver.find_element_by_css_selector("css_selector")#option 1: regular click
el.click()
#option 2: inject js
driver.execute_script("arguments[0].click();", el)

In practice, the second option (executing js) is more robust and you should prioritise it.

Scroll down

Scrolling down is quite useful when crawling sites with infinite scroll — like a feed for instance. Doing this with Selenium is simple — and uses javascript injection.

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Access links on page

Finding links on the page and clicking on random ones can sometimes be a good crawling strategy.

Here we will find all the links on the page and click on the first one. Obviously you can adapt the script to click on the links that contain certain keywords.

#find all the links
links = driver.find_elements_by_partial_link_text('')
#get the first one
l = links[0]
#click on it
driver.execute_script("arguments[0].click();", l)

Limit your bandwidth

Crawling with Selenium can be ressource intensive, especially in terms of traffic : loading images, caching etc…

However, using some simple tricks you will be able to crawl faster without killing your bandwidth.

Disable showing images

If you are not interested in images, you can simply tell Chrome not to load them by specifying the option below:

chrome_options = Options()
prefs={"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option('prefs', prefs)

Add a disk-cache

Adding a disk-cache-size is useful especially when crawling a single domain. Indeed, it means that you don’t have to reload everything. In this example we’ll set up a disk-cache-size of 4096.

chrome_options = Options()
prefs={'disk-cache-size': 4096}
chrome_options.add_experimental_option('prefs', prefs)

Go headless

Headless browsing means that the browser will not be visible. It is necessary when running on a server (as server does not have graphic cards) — it is also recommended to test on your local machine in headless mode.

To run headless, you just have to add this argument.

chrome_options = Options()
chrome_options.add_argument("--headless")

Take a screenshot

The problem when running headless is that is harder to understand how the page load and the results of your actions. One common fix is to take screenshots and save them on the drive.

driver = webdriver.Chrome()
driver.get('https://twitter.com/')
driver.save_screenshot("screenshot.png")
driver.close()

Download images

One easy way to download images using Selenium, is to use the urllib package:

  • load the page with Selenium
  • find the image element and its “src” attribute
  • download the file using urllib
import urllib
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://twitter.com/')

# get the image source
img = driver.find_element_by_css_selector('.Icon--bird')
src = img.get_attribute('src')

# download the image
urllib.urlretrieve(src, "twitter_icon.png")
driver.close()

Action Chains

Action Chains are used to automate low level interactions such as mouse movements, mouse button actions, key press, and context menu interactions.

When you call methods for actions on the ActionChains object, the actions are stored in a queue in the ActionChains object. When you call perform(), the events are fired in the order they are queued up.

This is particular useful to click on buttons that are not visible yet (like submenus etc..), hover or drag and drop. In this example we will click on submenus:

menu = driver.find_element_by_css_selector(".nav")
hidden_submenu = driver.find_element_by_css_selector(".nav #submenu1")

ActionChains(driver).move_to_element(menu).click(hidden_submenu).perform()

Proxies

Proxies can be useful to avoid being banned for a website or access geo-restricted content. For instance here we will use the fake proxy “99.99.99.9" on port “1111"

from selenium import webdriver

PROXY = "99.99.99.99:1111" # IP:PORT or HOST:PORT

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("http://whatismyipaddress.com")

Scrolling items in a drop down using Select

If you are trying to select an option from a select element with huge number of options — for example selection the country in a form, you will not be able to do it the ordinary way. You will have instead to:

  1. Locate the select element
  2. Find all options
  3. Go through each option to find the option you are looking for.
  4. Make it visible and click on it

Selenium as a built-in class that is very helpful and help you achieve it in a few lines of code:

from selenium.webdriver.support.ui import Selectvalue = "optionValue"
element = Select(driver.find_element_by_tag_name("select")
for option in element.options:
if option.get_attribute("value") == value:
element.select_by_visible_text(option.text)

Thanks for reading. Do not hesitate if you have comments, questions or feedbacks in the comments.

--

--

Igor Zabukovec

CTO / Data Scientist — Specialising in scalable intelligent systems and high-value products for business and consumer.