Pentest Magazine: Web Scraping with Python

heavenraiza
9 min readSep 29, 2016

--

Hello again folks! Here I am sharing my knowledge on Python. Hopefully most of you have heard of Python by now. If not, where have you been hiding? Like seriously! If you’re into infosec, programming, hacking, etc you should have heard of Python. If you don’t know it, what are you waiting for? As for me, I love coding. It nice having a vast tool set to choose from. If I am on a Windows box, I’ll code in PowerShell. If I am on a Linux or OS X box, Python is my choice. Nothing against Ruby but Python has gain more popularity in the community. Ruby enthusiasts will beg to differ. But seriously, it’s even the language of choice for Elliot (Mr. Robot). ;-)

I know there are plenty of articles regarding Python in this edition, but just in case, what is Python? As noted on https://www.python.org, “Python is a programming language that lets you work quickly and integrate systems more effectively”. “Python is powerful… and fast; plays well with others; runs everywhere; is friendly & easy to learn; is Open”, also from python.org. Python is also typically a language for first time programmers. Python is used to teach many computer courses and concepts, such as Computer Science. There are plenty of resources to learn the basics of Python. From a infosec perspective, Pentest Magazine has a Python course. :-)

If you’re running a Linux distro or OS X then mostly likely Python is already installed. Just launch a terminal session and type python. If it’s installed you should receive a similar output as to the one below.

In the screen shot above you see I am running Python 2.7.12 within this Ubuntu virtual machine. To exit the interactive prompt, type quit() and hit enter.

Another way to verify if you have Python installed is by typing the following, python — version.

If you’re running Windows, you can install the Windows binary. Since Python 3 has not gain popularity, it is recommended you install Python 2.7.x. The latest version is Python 2.7.12. You can download from https://www.python.org/downloads/.

Since this article is not about introducing you to Python programming concepts, general syntax, etc we’ll dive right into the subject of web scraping. What is web scraping? Web scraping is a computer software technique of extracting information from websites. The technique is also known as web harvesting or web data extraction, according to Wikipedia. Python is a good language for web scraping. In this article we’ll use the Pentest Magazine website.

So for this technique you will need some extra modules. The first I’ll mention is urllib. The urllib module provides a high-level interface for fetching data across the World Wide Web. In particular, the urlopen() function is similar to the built-in function open(), but accepts Universal Resource Locators (URLs) instead of filenames. Some restrictions apply — it can only open URLs for reading, and no seek operations are available, as noted on python.org.

So typically the first step will be to check if the website is up and fetch it. First let’s import urllib to an interactive prompt and see what’s available to use with this module.

As we can see there is a good amount of functions with the urllib package. In order to fetch a web page we’ll use urlopen. At this point I’ll start a text editor and begin writing my Python script.

The basic code above basically does the following:

  • import the urllib package/module
  • use the urlopen function and insert results into variable called webpage
  • echo to the console the http code (200, 403, 404, etc)

This will tell us whether the website is up or not. Now based on the output, it’s giving a code of 403, which means Forbidden.

To overcome this hurdle and many other that you might encounter in your web scraping adventures we’ll add some header information to our script using urllib2. “urllib2” is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations — like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers, from python.org.

Below is the rewritten script adding header information tricking the website that GoogleBot is crawling.

Now we can run the script and see that the http code will be different now.

Now let’s go back to our original script. Just because the code return was 403 .. can we still read the source? That is what we need for scraping, the source code. So lets run that script, with a slight modification, and see the output from the read() function (which was added).

Updated code:

Output:

As we can see, we can’t see the source page. Now let’s see the output with the headers as GoogleBot.

Output (Python script with header info was saved as scraper2.py):

Ok, so at this point we know we can obtain the source code with Python. Now begins the meat of it all, receive data and parse it. One of the challenges with HTML parsing is that whatever website you’re looking to scrape might not adhere to HTML standards and/or will have broken HTML commands. So you need to spend some time manually inspecting the source code in order to build your scraper. You will need to choose which route you will take to help you in this task. What parser will you use? You might use LXML, HTMLParser, BeautifulSoup.

For this article I’ll cover BeautifulSoup. You can read more information about BeautifulSoup at https://www.crummy.com/software/BeautifulSoup/. An excerpt from the site regarding BeautifulSoup:

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

  • Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application .
  • Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don’t have to think about encodings, unless the document doesn’t specify an encoding and Beautiful Soup can’t detect one. Then you just have to specify the original encoding.
  • Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

If you’re running Ubuntu you download/install BeautifulSoup by running the following commands:

  • Run the following command if you don’t have PIP installed:
  • sudo apt-get install python-pip
  • pip install beautifulsoup4

Within this article I will go through a real case of web scraping to show how meticulous the process can be to give you an idea of a real process. Meaning I will not show you the success case but rather trial and error. Now by investigating the website, we have to chose what we want to scrape. So let’s say I want to learn more about pentesting with Python. On the home page of pentestmag.com there is a course titled ‘Automate Your Pentests with Python’. I would like to retrieve this through Python.

If we inspect the source, let’s see what we need to target in order to achieve our task.

So we can see the following:

  • <li> tag
  • 2 <div> tags
  • <a> tag
  • <img> tag

At this point we can begin to update our script. We will import BeautifulSoup and use it for our web scraping.

Updated script:

After verifying everything is working so far, now we’ll beginning using ‘BS’ (BeautifulSoup) to get elements from the webpage. Now lets think this through. We want to retrieve information on Python courses from the web page. The information we’re seeking for is a link but an <img> tag is within the link, no text. This link is within multiple <div> tags within a <li> tag. On top of that the <div> tags are not using ID attributes. Instead they’re using CLASS attributes. So being that an <img> tag is used for the Python course, maybe we can retrieve all the <img> tags? Now look at the source again. There isn’t anything distinguishable within the <img> tag that will let us know we retrieved what we’re seeking for. The next viable option would be to seek all <a> tags.

Updated code:

Output:

Based on the output there are 91 links. Now do we want to loop through 91 links to retrieve the info we’re seeking for? Let’s see we’re ok with this. Let’s loop through the links and pull the location the link is pointing to using the HREF attribute.

Updated code:

Output:

So we have an issue but why? We can either dig through the source or use BS to find out why. Let’s update the script and have it output all the links in general.

Updated code:

Output:

We see from the output above that the last line in the screen shot the <a> tag uses CLASS instead of HREF. Now we can update our script to catch that error or to ignore links that are not using HREF but honestly we can narrow down our search to look at all the <div>’s, or a particular <div>, and retrieve the links from the <div> instead. By doing so we will have a smaller result set that we need to work with. The above case is an example of something I mentioned earlier in the article. Scraping is not straightforward and you shouldn’t expect that the HTML code is coded a particular way. Here it was assumed that HREF is used with every <a> tag and that wasn’t the case.

Now we’ll update the script to look for <div>’s instead. How many do we get in our result set?

Updated code:

Output:

So our result set is larger with <div>’s and that makes sense. How can we narrow down our result set? By looking at the source code we see that we can use CLASS=”block_media images_only” to narrow our search down.

Updated code:

Output:

Great! We’re down to 15 results. Now we’ll need to pull the HREF data from all the <a> tags within these 15 <div>’s. Hopefully our code won’t break.

Updated code:

Output:

There we go! You can always add extra code so the script will only output the information you’re seeking, the 1 Python link, but I’ll leave that up to you. :-)

Hopefully after reading this article, and the other articles within this edition of Pentest Magazine, you’ll see the value of adding Python to your tool set. It’s a powerful language yet super easy to learn and use.

If you want to continue learning the concept of web scraping, you can check out the full documentation for BeautifulSoup,https://www.crummy.com/software/BeautifulSoup/bs4/doc/. You can also look into a package called mechanize, http://wwwsearch.sourceforge.net/mechanize/. There also is a web scraping framework you can check out called Scrapy, http://doc.scrapy.org/en/1.1/index.html.

Thanks for reading!

Sam Vega (@heavenraiza)

--

--