Python — Automate operations using web scraping

Amjad Hussain Syed
tajawal
Published in
3 min readApr 14, 2018

If you’re like tajawal who always strive to run on latest software versions, this article will help you and your DevOps team to find the new version of the software.

To find a new version you have to browse each and every page to see if the new version has been released and then plan for the upgrade.

we can automate this complete process using a technique called web scraping, I’m using python to achieve this. Following are some of the modules where you can achieve web scraping.

  • BeautifulSoup
  • Lxml
  • Selenium

Web scraping?

Web scraping is process to extract the data from a websites. First, we send a GET request to the URL and download the HTML content. Once it’s done data is ready to search and formatted.

Getting started

We are going to use requests to download the page content and lxml to query the downloaded content using xpath.

To install requests and lxml we use pip, pip is a package management tool for python.

Launch the terminal and type the following commands to install the libraries

sudo pip install lxml
sudo pip install requests

Now that we have the required modules in place we need to analyze the page which we want to extract the data from. for this example, we are going to fetch the elasticsearch version. To perform this, I use chrome developer tools to inspect the page and copy the xpath where we have the version.

we have successfully created the xpath to get the version. now its time to code.

Code:

now that we have fetched the version we can store it to CSV/YAML file and schedule it to run daily/weekly and compare this version, if new version found send email or send a slack notification to the channel.

Things to take care

  1. Do not run the program too frequently as it may impact site performance and get blocked.
  2. This process will not work if the website you’re requesting changes the layout. you will have to revisit the website and change the code.

--

--