5 Simple Tips for Efficient Web Crawling using Selenium Python
In this article, I will share 5 simple tips that will help you to improve automation of your web scraping bot or crawler that you wrote using python selenium. But, first let me briefly introduce you with python’s selenium module in case, if you are not familiar with it:
Now, lets jump straight to the first tip:
Tip No 1: Crawl websites without loading Images
As we are talking about, automated scripts these scripts will run hundreds or thousands time. So, every seconds (perhaps, milliseconds?) count. Most of the modern dynamic websites have lots of images. When a page loads, selenium loads all the elements in it including those images!
Hence, even though we don’t interact much with those images when we are testing website functionalities. Selenium still loads them! The good thing is there are ways to load pages without loading images in selenium! I will show the codes for phantomjs webdriver & chromedriver below:
PhantomJS Web Driver Load Web Page without Images
Chrome Web Driver Load Pages without Images:
Tip No: 2 Scrape Websites using Disk Cache
Caching the assets often leads to faster page loads. In modern web browser, disk caching reduces page loading time impressively. You can take the advantage of this on selenium web driver as well!. All you have to do is set the configuration before initialization of web driver. This basically stores all the websites assets like css, js in the disk storage for faster loading. Helpful, when you load multiple pages of the same website
Chrome Web Driver Load Pages using Disk Cache:
PhantomJS Web Driver Load Web Page with Disk Cache
Tip No: 4 Scrolling items in a drop down using Select
Lets say, you are trying to select an option from a select element with huge number of options say 20+. In this case, you can’t select the item in ordinary way. You will have to first locate the select element using driver. Then, find all the options. Filter through each of them to find the appropriate option you are looking for. After that, make it visible and click on it! Fortunately, selenium has a class called ‘Select’ which will help you to do the above task in a lot more easier way! Take a look at the following script. Its pretty self explanatory but, if you have any question about it feel free to comment!
Tip No: 5 Properly close the driver using quit
It is important to properly close the driver after finishing the automation specially, if you run your scripts periodically! When you invoke your python automation script to do things for you, it uses additional resources/processes for the selenium web drivers. After the python script finishes execution it doesn’t releases those additional resources if you don’t tell it to do so! One way to do it is to use driver.close().
But, driver.close() doesn’t always stop the web driver that’s running in background. And, if you are doing things using multiple tabs of the selenium web browser then it closes only the current one! Leading others open.
You can use driver.quit() instead, as it closes all the tabs of the selenium web browser . The Selenium Web driver instance is also gets killed after this!
Additionally, I usually append a os.system(‘killall phantomjs’) command or, os.system(‘killall chrome’) in the code when I am only running a single task in my computer. Just a little hack to make sure all the resources are freed up.
Warning: It might not be a good thing to do if you are running multiple scripts at once! Or say, surfing Internet using Google Chrome on the same computer. The Command- killall chrome will definitely kill your browser as well!
I will wrap it up by saying one last thing:
Always minimize the number of requests you make to the Web Server. Try to reduce it as much as possible
That is all for this moment! I hope these tips will help you to automate your web testing or, write more efficient web crawlers from now on.
I would love to hear from you! please, feel free to comment your feedback, or share your preferred way to do the above things and, don’t forget to clap for it.
EDIT: Phantom JS is deprecated now! So, use chromedriver or gecko driver instead!
If you are looking for python selenium alternatives, you might want to check these modules: