5 Simple Tips for improving Automated Web Testing or Efficient Web Crawling using Selenium Python
In this article, I will share 5 simple tips that will help you to improve automation of your website testing script, your web scraping bot or crawler that you wrote using python selenium. But, first let me briefly introduce you with python’s selenium module in case, if you are not familiar with it:
Now, lets jump straight to the first tip:
Tip No 1: Crawl websites without loading Images
As we are talking about, automated scripts these scripts will run hundreds or thousands time. So, every seconds (perhaps, milliseconds?) count. Most of the modern dynamic websites have lots of images. When a page loads, selenium loads all the elements in it including those images!
Hence, even though we don’t interact much with those images when we are testing website functionalities. Selenium still loads them! The good thing is there are ways to load pages without loading images in selenium! I will show the codes for phantomjs webdriver & chromedriver below:
PhantomJS Web Driver Load Web Page without Images
Chrome Web Driver Load Pages without Images:
Tip No: 2 Scrape Websites using Disk Cache
Caching the assets often leads to faster page loads. In modern web browser, disk caching reduces page loading time impressively. You can take the advantage of this on selenium web driver as well!. All you have to do is set the configuration before initialization of web driver. This basically stores all the websites assets like css, js in the disk storage for faster loading. Helpful, when you load multiple pages of the same website
Chrome Web Driver Load Pages using Disk Cache:
PhantomJS Web Driver Load Web Page with Disk Cache
Tip No: 4 Scrolling items in a drop down using Select
Lets say, you are trying to select an option from a select element with huge number of options say 20+. In this case, you can’t select the item in ordinary way. You will have to first locate the select element using driver. Then, find all the options. Filter through each of them to find the appropriate option you are looking for. After that, make it visible and click on it! Fortunately, selenium has a class called ‘Select’ which will help you to do the above task in a lot more easier way! Take a look at the following script. Its pretty self explanatory but, if you have any question about it feel free to comment!
Tip No: 5 Properly close the driver using quit
It is important to properly close the driver after finishing the automation specially, if you run your scripts periodically! When you invoke your python automation script to do things for you, it uses additional resources/processes for the selenium web drivers. After the python script finishes execution it doesn’t releases those additional resources if you don’t tell it to do so! One way to do it is to use driver.close().
But, driver.close() doesn’t always stop the web driver that’s running in background. And, if you are doing things using multiple tabs of the selenium web browser then it closes only the current one! Leading others open.
You can use driver.quit() instead, as it closes all the tabs of the selenium web browser . The Selenium Web driver instance is also gets killed after this!
Additionally, I usually append a os.system(‘killall phantomjs’) command or, os.system(‘killall chrome’) in the code when I am only running a single task in my computer. Just a little hack to make sure all the resources are freed up.
Warning: It might not be a good thing to do if you are running multiple scripts at once! Or say, surfing Internet using Google Chrome on the same computer. The Command- killall chrome will definitely kill your browser as well!
Bonus: Some words of wisdom
First lets discuss a scenario, assume that you have a website that have thousands of products on different categories. You want to automate a task that requires you to crawl through all the product pages and, fetch their product data by visiting individual product description page.
Lets say, you can do this in two simple steps: First crawl through all the category pages. Then get the product description page’s link by navigating through the pagination. After collecting all these product description links, Second Step: crawl through all these product description links to get the product data for doing analysis. So far its good…
However, if you have to do it frequently say for every 6 hours and these product links are constant for every products. Also, the website doesn’t update the existing products way too much. Then you can store the product links after the first step and, perform the second step using the stored product links! This simple trick will not only reduce the requests your crawler make to the server but also save you a lot of resources e.g. ram, CPU power and time. Instead of fetching all the product links by visiting all the categories with page navigation, the script will utilize the stored links and crawl directly through those links… Pretty neat!
You can always update the stored product description links by periodically running the first step in a lesser frequency. This will make sure you don’t miss any newly added product.
That is all for this moment! I hope these tips will help you to automate your web testing or, write more efficient web crawlers from now on.
I would love to hear from you! please, feel free to comment your feedback, or share your preferred way to do the above things and, don’t forget to clap for it.