5 Simple Tips for Efficient Web Crawling using Selenium Python

ওয়াসী (Wasi)
Dreamcatcher IT’s Blog
5 min readJan 5, 2018

--

In this article, I will share 5 simple tips that will help you to improve automation of your web scraping bot or crawler that you wrote using python selenium. But, first let me briefly introduce you with python’s selenium module in case, if you are not familiar with it:

It is actually a python binding for the API of Selenium Web Drivers. For example, you will be able to conveniently access the API of Selenium Web drivers like Firefox, Chrome, PhantomJS, etc. Using this module, you can use web driver API to simulate all sorts of actions that you can perform on a typical Web Browser! i.e. click on buttons of websites, scroll and navigate through pages, type something in input boxes, submit forms, use proxies, even execute custom Javascripts on pages and many more! All these stuff using just a python script! Pretty cool! right?
Now, let's jump straight to the first tip:

Tip No 1: Crawl websites without loading Images

As we are talking about, automated scripts these scripts will run hundreds or thousands of times. So, every second (perhaps, milliseconds?) count. Most of the modern dynamic websites have lots of images. When a page loads, selenium loads all the elements in it including those images!

Hence, even though we don’t interact much with those images when we are testing website functionalities. Selenium still loads them! The good thing is there are ways to load pages without loading images in selenium! I will show the codes for phantomjs webdriver & chromedriver below:

PhantomJS Web Driver Load Web Page without Images

Chrome Web Driver Load Pages without Images:

Tip No: 2 Scrape Websites using Disk Cache

Caching the assets often leads to faster page loads. In a modern web browser, disk caching reduces page loading time impressively. You can take advantage of this on selenium web driver as well!. All you have to do is set the configuration before the initialization of the web driver. This basically stores all the website's assets like CSS, js in the disk storage for faster loading. Helpful, when you load multiple pages of the same website
Note: Obviously, when automating your tests, you can’t (and shouldn’t) cache the assets that have effects on the data you want to test! For example, if you are testing a dynamic website where the data is loaded using assets say javascript then, disk cache might even make your tests obsolete!

Chrome Web Driver Load Pages using Disk Cache:

PhantomJS Web Driver Load Web Page with Disk Cache

Tip No: 3 Use Javascript for scrolling

When interacting with page elements especially clicking on buttons, if the element you are looking for is not visible in the viewport. Selenium raises an exception notifying that element is not visible. I prefer Javascript for scrolling the element into view and, then wait for a bit using time.sleep() so that the scroll effect goes off. And, then trigger the click… simple!

Tip No: 4 Scrolling items in a drop-down using Select

Let's say, you are trying to select an option from a select element with a huge number of options say 20+. In this case, you can’t select the item in an ordinary way. You will have to first locate the select element using the driver. Then, find all the options. Filter through each of them to find the appropriate option you are looking for. After that, make it visible and click on it! Fortunately, selenium has a class called ‘Select’ which will help you to do the above task in a lot easier way! Take a look at the following script. Its pretty self explanatory but, if you have any questions about it feel free to comment!

Tip No: 5 Properly close the driver using quit

It is important to properly close the driver after finishing the automation especially if you run your scripts periodically! When you invoke your python automation script to do things for you, it uses additional resources/processes for the selenium web drivers. After the python script finishes execution it doesn’t release those additional resources if you don’t tell it to do so! One way to do it is to use driver.close().

But, driver.close() doesn’t always stop the web driver that’s running in the background. And, if you are doing things using multiple tabs of the selenium web browser then it closes only the current one! Leading others open.
You can use driver.quit() instead, as it closes all the tabs of the selenium web browser. The Selenium Web driver instance is also gets killed after this!

Additionally, I usually append an os.system(‘killall phantomjs’) command or, os.system(‘killall chrome’) in the code when I am only running a single task in my computer. Just a little hack to make sure all the resources are freed up.
Warning: It might not be a good thing to do if you are running multiple scripts at once! Or say, surfing the Internet using Google Chrome on the same computer. The Command- killall chrome will definitely kill your browser as well!

I will wrap it up by saying one last thing:

Always minimize the number of requests you make to the Web Server. Try to reduce it as much as possible

That is all for this moment! I hope these tips will help you to automate your web testing or, write more efficient web crawlers from now on.

I would love to hear from you! please, feel free to comment on your feedback, or share your preferred way to do the above things and, don’t forget to clap for it.
;)

EDIT: Phantom JS is deprecated now! So, use chromedriver or gecko driver instead!

If you are looking for python selenium alternatives, you might want to check these modules:

requests-html

mechanize

Scrapy

Marionette

Author:

Wasi Mohammed Abdullah

Thinker, Day Dreamer, Python Enthusiast, Javascript Admirer Introvert

CEO, Founder
Dreamcatcher IT
twitter: twitter.com/wasi0013
github:
github.com/wasi0013
facebook: fb.me/wasi0013

P.S. I have switched from medium to my personal blog. Visit my blog to get the latest posts! You can also suggest the topic of your choice by contacting me, I will try my best to write about it. :)

Personal Blog: https://wasi0013.com/blog

--

--