5 Simple Tips for Efficient Web Crawling using Selenium Python

In this article, I will share 5 simple tips that will help you to improve automation of your web scraping bot or crawler that you wrote using python selenium. But, first let me briefly introduce you with python’s selenium module in case, if you are not familiar with it:

It is actually a python binding for the API of Selenium Web Drivers. For example, you will be able to conveniently access the API of Selenium Web drivers like Firefox, Chrome, PhantomJS etc. Using this module, you can use web driver API to simulate all sorts of actions that you can perform on a typical Web Browser! i.e. click on buttons of websites, scroll and navigate through pages, type something in input boxes, submit forms, use proxies, even execute custom Javascripts on pages and many more! All these stuff using just a python script! Pretty cool! right? 
Now, lets jump straight to the first tip:

Tip No 1: Crawl websites without loading Images

As we are talking about, automated scripts these scripts will run hundreds or thousands time. So, every seconds (perhaps, milliseconds?) count. Most of the modern dynamic websites have lots of images. When a page loads, selenium loads all the elements in it including those images!

Hence, even though we don’t interact much with those images when we are testing website functionalities. Selenium still loads them! The good thing is there are ways to load pages without loading images in selenium! I will show the codes for phantomjs webdriver & chromedriver below:

PhantomJS Web Driver Load Web Page without Images

Chrome Web Driver Load Pages without Images:

Tip No: 2 Scrape Websites using Disk Cache

Caching the assets often leads to faster page loads. In modern web browser, disk caching reduces page loading time impressively. You can take the advantage of this on selenium web driver as well!. All you have to do is set the configuration before initialization of web driver. This basically stores all the websites assets like css, js in the disk storage for faster loading. Helpful, when you load multiple pages of the same website
Note: Obviously, when automating your tests, you can’t (and shouldn’t) cache the assets that has affects on the data you want to test! For example, if you are testing a dynamic website where the data is loaded using assets say javascript then, disk cache might even make your tests obsolete!

Chrome Web Driver Load Pages using Disk Cache:

PhantomJS Web Driver Load Web Page with Disk Cache

Tip No: 3 Use Javascript for scrolling

When interacting with page elements specially clicking on buttons, if the element you are looking for is not visible in the view port. Selenium raises an exception notifying that element is not visible. I prefer Javascript for scrolling the element into view and, then wait for a bit using time.sleep() so that the scroll effect goes off. And, then trigger the click… simple!

Tip No: 4 Scrolling items in a drop down using Select

Lets say, you are trying to select an option from a select element with huge number of options say 20+. In this case, you can’t select the item in ordinary way. You will have to first locate the select element using driver. Then, find all the options. Filter through each of them to find the appropriate option you are looking for. After that, make it visible and click on it! Fortunately, selenium has a class called ‘Select’ which will help you to do the above task in a lot more easier way! Take a look at the following script. Its pretty self explanatory but, if you have any question about it feel free to comment!

Tip No: 5 Properly close the driver using quit

It is important to properly close the driver after finishing the automation specially, if you run your scripts periodically! When you invoke your python automation script to do things for you, it uses additional resources/processes for the selenium web drivers. After the python script finishes execution it doesn’t releases those additional resources if you don’t tell it to do so! One way to do it is to use driver.close().

But, driver.close() doesn’t always stop the web driver that’s running in background. And, if you are doing things using multiple tabs of the selenium web browser then it closes only the current one! Leading others open. 
You can use driver.quit() instead, as it closes all the tabs of the selenium web browser . The Selenium Web driver instance is also gets killed after this!

Additionally, I usually append a os.system(‘killall phantomjs’) command or, os.system(‘killall chrome’) in the code when I am only running a single task in my computer. Just a little hack to make sure all the resources are freed up. 
Warning: It might not be a good thing to do if you are running multiple scripts at once! Or say, surfing Internet using Google Chrome on the same computer. The Command- killall chrome will definitely kill your browser as well!

I will wrap it up by saying one last thing:

Always minimize the number of requests you make to the Web Server. Try to reduce it as much as possible

That is all for this moment! I hope these tips will help you to automate your web testing or, write more efficient web crawlers from now on.

I would love to hear from you! please, feel free to comment your feedback, or share your preferred way to do the above things and, don’t forget to clap for it. 
;)

EDIT: Phantom JS is deprecated now! So, use chromedriver or gecko driver instead!

If you are looking for python selenium alternatives, you might want to check these modules:

requests-html

mechanize

Scrapy

Marionette

Author:

Wasi Mohammed Abdullah

Thinker, Day Dreamer, Python Enthusiast, Javascript Admirer An Introvert with Exception!

CEO, Founder 
Dreamcatcher IT
twitter: twitter.com/wasi0013
github:
github.com/wasi0013
facebook: fb.me/wasi0013