Web Scraping — An Introduction

Hugh Gallagher
Analytics Vidhya
Published in
5 min readDec 18, 2019

--

Web scraping is, typically, an automated process that involves scanning a list of web pages looking for data, and then saving that data for later use. This is something that could be done manually, but is labour intensive and unnecessary when an easier alternative exists.

GIF of magnifying glass scanning web page

As a tool for automating the scraping process, I’ll look at the Selenium library for use with python. This is a very straightforward and functional library, that can use an array of different browsers to automate the scraping process. To begin a script you’ll have to create a “driver” object. This is where you’ll choose which browser you are going to be using. I will be skimming over this point in this article, but will refer to it as ‘driver’ where necessary.

A good portion of web scraping is having some basic knowledge of HTML; to be able to read through a web page to pinpoint what you’re scraping for. I am going to assume you know at least the structure of a basic, blank web-page, and the names of the basic elements.

Snippet of html from google.com
Snippet of HTML from Google

Even just looking at the first dozen or so lines of HTML from Google.com (on the left), shows that the number of cascading and nested elements to traverse adds up quickly. I’ll explain navigating through these nested elements (and how the structure of some pages can make it easier).

First let’s get an example of a more simplistic web-page, so that we can discuss how to pick out the data we want. Below you’ll find the HTML (with some internal CSS), that makes up a simple, yet very unappealing, web-page. I’ll use this as our example.

<!DOCTYPE html>
<html>
<head>
<style>
.a_div {background-color: blue; padding: 10px;}
.a_span {background-color: red; padding: 10px;}
#the_only_span {border-color: white; border-width: 5px; border-style: solid;}
#relevant_data {background-color: pink; padding: 10px;}
</style>
<title></title>
</head>
<body>
<div class="a_div">
<div class="a_div">
<span class="a_span" id="the_only_span">
<div id="relevant_data">
Web Scraping is a useful tool.
</div>
</span>
</div>
</div>
</body>
</html>

The div with the id “relevant_data” is the one we want to access. There are two main methods of doing so — both require us to be able to read the HTML structure, with one also needing us to understand which element is nested in which.

Let’s start with the easier of the two. This <div> has an id, a unique identifier in the page. This makes it singularly identifiable. So to find the text held in the <div> by identifying it by id we would use:

driver.get([WEB-PAGE NAME HERE]) # TO NAVIGATE TO THE PAGErelevant_div = driver.find_element_by_id('relevant_data')print(relevant_div.get_attribute('innerHTML')) # PRINT TEXT

As you can see this is a very straightforward process, and I wish more sites could use such an easy to interact with structure. Alas, most use potentially inconsistent classes rather than id’s. However, despite using awkwardly named classes, they tend to use a consistent nesting system. Which leads us in to the second main method of finding the sought after element. Using the ‘XPath’.

Let’s remove the id from our element of interest now. How would we go about uniquely describing it? Given the simplicity of the example page, it wouldn’t be too tricky. I’ll make it a little more complex, but remove the CSS.

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div class="a_div">
<div class="a_div">
<span class="a_span" id="the_only_important_span">
<div>
Web Scraping is a useful tool.
</div>
</span>
</div>
</div>
<div class="a_div">
<span class="a_span">
<span class="a_span" id="the_only_span">
<div>
Ignore me.
</div>
</span>
</span>
</div>
</body>
</html>

So I’ve doubled up on the nested elements here, but you can still see the important <span> that we’re looking for. From here I’ll show two ways the same method can be used to find this <span>. The first will be by going through the span with the “the_only_important_span” id.

relevant_div = 
driver.find_element_by_xpath(//span[@id=
'the_only_important_span']/div[1])
print(relevant_div.get_attribute('innerHTML')) # PRINT TEXT

So let’s walk through what this xpath is telling us. The ‘//’ says to look from any element that matches the string that follows. As we have ‘//<span>’, we’ll be looking at the present span elements. But we put a further restriction on it, with ‘@id=“the_only_important_span”’. Which means we only look at any <span>’s with that id, of which there happens to be only one. The ‘/<div>[1]’ tells us to choose the first <div> element that appears within our first chosen element. In this instance that is exactly the <div> we are looking for!

As a quick aside: xpath is not zero indexed, that is to say counting begins at ‘1’ and not ‘0’.

Now let’s use xpath in a slightly different way, with the same example. This time we won’t use the id, but instead will be very verbose, and specific in describing how the elements are nested:

relevant_div = driver.find_element_by_xpath(//div/div/span/div[1])print(relevant_div.get_attribute('innerHTML')) # PRINT TEXT
GIF of a road following the xpath

Following from what was described in the previous example, we’re again looking at starting from any (‘//’) <div>. In this <div> we look for a <div>, that contains a <span>. In that <span> we look for the first <div>. This again uniquely describes our important <div>, in a slightly different way.

In both of the above examples the ‘[1]’ could be omitted as there is only one <div> to be found, but it is good practice to be explicit to account for any changes.

That covers the main methods of locating elements and retrieving the information they contain. For the most part, the remainder of web scraping is shaping your search paths, storing your data, and deciding what to do with them!

You might be wondering why web scraping? What are some applications of it? Well a part of the position I’m in at the moment involves logging stats about my LinkedIn posts. Typically views and reactions (just a simple number, not a breakdown of which reactions). This is something I had expected would be available to me through the LinkedIn API. But alas, it was one of the many features removed between their version 1 and version 2 APIs.

So I went about setting up a couple of virtual machines to carry out the web scraping instead. If you’d like to see how I went about that, check out my next article!

In the meantime consider signing up for an Oracle free cloud trial and see if you can make something cool with it too!

Questions, comments? You can find me over on LinkedIn.

* All views are my own and not that of Oracle *

--

--