How to scrape a website

Guillaume Odier

Published in

Captain Data

4 min readNov 6, 2018

We often hear about how much data is on the web and how it’s growing exponentially from year to year.

That leads to discussions of Big Data, and Machine Learning, and so on. But in the end, what do YOU do with web data?

The answer is probably nothing, because 99% of websites don’t let you access their data easily. You need access to that information, in a scalable way.

Luckily, there’s web scraping to the rescue!

Web scraping allows you to automatically extract any content from any website. You can virtually scrape anything, from e-commerce shops to GitHub repositories.

How it works

First, you have to understand how a web page is created, and particularly how HTML works.

A web browser renders HTML documents. These documents describe the structure of the page semantically.

Think of it as a tree with branches. In reality, to render a web page, web browsers organize the HTML document in a tree structure called the DOM (Document Object Model).

<!DOCTYPE html> 
<html>
    <head>
        <title>This is a title</title>
    </head>
    <body>
        <h1>Heading</h1>
        <p>Hello world!</p>
    </body>
</html>

What you need to keep in mind is that everything is nested.

This is a very basic structure. You could nest it the way you want:

... 
<body>
    <h1>Heading</h1> 
    <p> 
        <span> I'm nested <b>I'm nested and bold!</b>
            <span> Wow, too much nesting for me, I'm getting lost
                <span>Wait... can you actually do that?</span> 
            </span>
        </span>
    </p>
</body>
...

There are some rules to respect, but that’s not the topic of this article.

The elements “h1” and “p” are tags. They can be described by attributes:

<h1 class="nice-heading" id="main-heading"/>Heading</h1>

Attributes further describe the tags (nodes). They are very, very useful, mostly because they let you describe a path to the data.

Indeed, when you say “I want to extract data” from the single line of code above, what you’re referring to is the “Heading” value, which is a text value.

But how do you access this data?

Accessing Data

Okay, so let’s say we have the following code:

...
<body>
    <div class="container"> 
    ...
        <div class="card"> 
            <h3 class="use-case">Repositories</h3> 
            <p>Enrich your business database or find new leads to feed your CRM.</p> 
        </div>
    ...
    </div>
</body>
...

The previous (simplified) code outputs the following:

In this case, how do you access the text of the first card described by our HTML code?

Easy! You need to use the XPath language.

The XPath language is based on a tree representation of the XML document, and provides the ability to navigate around the tree, selecting nodes by a variety of criteria.

Remember how I said that an HTML document is like a tree with branches? Well, it’s the same for XML. Both of these languages are what we call a markup language.

Xpath gives you the ability to navigate the DOM (remember, the fact HTML is organized into a tree structure with branches!).

In the end, it’s very simple to access data, because what you get is the following structure:

div.container
    -- div.card
        -- h3.use-case 
          #text
        -- p
          #text

Now, if you want to access the text inside the <p> tag by using XPATH:

document.xpath("//div[@class="container"]//div[class="card"]//p/text()")

Basically, what this code says is: “Take the div container then go to the div card and extract the text inside the p tag”.

This way, you’re able to extract the text “Enrich your business database or find new leads to feed your CRM“.

Amazing, isn’t it?

The Next Level

Now that you understand the basics, you need to dive a bit deeper into programming.

For most scraping use cases, I generally recommend to use Python.

Here is an example to Scrape Websites with Python and BeautifulSoup.

There’s an amazing community and tons of packages and libraries that you can use to scrape web data.

Among others:

We’ve only been talking about basic HTML pages, but you probably know that websites nowadays use more and more JavaScript to build very cool stuff.

Unfortunately, JS does not simplify web scraping. But there’s a solution to every problem 🙂

Some examples of useful libraries:

Puppeteer (which is maintained by Google itself!)
NightmareJS

To help you a bit, here’s a great XPath Cheatsheet to use whenever you want to access complicated nested data.

If you need help with web scraping, be sure to get in touch.

Be sure to check out our blog to get a sense of what you can do with web scraping.

Originally published at captaindata.co on November 6, 2018.

How to scrape a website

How it works

Accessing Data

The Next Level

Written by Guillaume Odier