Mastering Web Scraping with Python — Introduction part1

Tobiloba Adejumo
Dataly.ai
Published in
13 min readMar 30, 2019

Web scraping, also known as screen scraping, data mining, web data extracting or web harvesting; is a method of extracting large amounts of data from a website. The extracted data are then analysed bringing out deductions and inferences from the data.

Preface

I watched an Udemy course on “Web Scraping with Python: BeautifulSoup, Requests and Selenium” by Waqar Ahmed. I also read three great books: Website Scraping with Python” by Gábor László Hajba, “Web Scraping with Python” by Ryan Mitchell and “Python Requests Essential” by Rakesh Vidya Chandra & Bala Subrahmanyam Varanasi, which inspired me to try it out and that is why I am sharing the knowledge with you.

Series Intermission

  • Mastering Web Scraping with Python — Introduction part 1(you are here)
  • Mastering Web Scraping with Python — Introduction part 2
  • Mastering Web Scraping with Python — Intermediate part 1
  • Mastering Web Scraping with Python — Intermediate part 2
  • Mastering Web Scraping with Python — Advanced part 1
  • Mastering Web Scraping with Python — Advanced part 2

Prerequisite

I assume that you have basic programming skills and experience in any language. I also assume that you have successfully set up a Python environment (Jupyter Notebook, PyCharm e.t.c.) on your PC . If not, download the Anaconda Distribution. The Anaconda distribution is easiest way to perform Python/R data science and machine learning on Linux, Windows, and Mac OS X. Also, it comes with pre-installed Python packages too, and of course comes with Jupyter Notebook :).

In this intermission, I mainly cited
Vik Paruchuri’s post (Python Web Scraping Tutorial using BeautifulSoup)

Libraries Needed In This Intermission

The various Python libraries listed below will be used in this series. I will explain the function of each library as we proceed in the series.

  • Beautiful Soup
  • lxml
  • Requests
  • Pandas

#Beautiful Soup

From the official documentation of Beautiful Soup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

For better understanding, what beautiful soup does is to extract data out of HTML or XML files. The extracted data isn’t arranged properly so it works together with a parser. This parser helps to arrange and sort the data into a tree based on some certain rules and conditions. After the data is now in a tree-like structure, it becomes very easy to perform necessary operations on the data.

To install beautiful soup:

pip install beautifulsoup4

#lxml

From the official documentation of lxml:

lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more.

Do you understand? No? Me too 😃 .. I didn’t understand it at first. But remember that beautiful soup needs a parser; lxml is a parser. It reads the data, modifies it and also arranges data to a tree like structure based on ElementTree standard.

There are other kinds of parsers available such as the html.parser and the html5lib but the lxml parser is relatively faster than the others.

To install lxml:

pip install lxml

#Requests

From the official documentation of requests:

Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.

Requests is actually an Apache licensed library. Now the main function of requests is that it interacts with your web page. Rather than using a web browser to send messages to a web server, the requests library mimics a web browser. Therefore, the interaction with the server can be carried out programmatically by importing the requests library.

To install requests:

pip install requests

#Pandas

From the documentation:

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Pandas is an acronym for Python Data Analysis Library. It converts data from different formats to a Python object. This object is known as data frame. It is essential that we represent our data in a dataframe so as to ensure easy analysis and manipulation of data.

To install pandas:

pip install pandas

Intermission index

  • Components of a web page.
  • HTML
  • The request library.
  • Passing a page with beautiful soup.
  • Finding all instances of a tag at once.
  • Searching for tags by class and id
  • Using CSS selectors

The Components of a web page

When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:

  • HTML — contain the main content of the page.
  • CSS — add styling to make the page look nicer.
  • JS — Javascript files add interactivity to web pages.
  • Images — image formats, such as JPG and PNG allow web pages to show pictures.

After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML.

HTML

HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, create paragraphs, and so on. Because HTML isn’t a programming language, it isn’t nearly as complex as Python.

Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the <html> tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:

We haven’t added any content to our page yet, so if we viewed our HTML document in a web browser, we wouldn’t see anything:

Right inside an html tag, we put two other tags, the head tag, and the body tag. The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:

We still haven’t added any content to our page (that goes inside the body tag), so we again won’t see anything:

You may have noticed above that we put the head and body tags inside the html tag. In HTML, tags are nested, and can go inside other tags.

We’ll now add our first content to the page, in the form of the p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:

Here’s how this will look:

Tags have commonly used names that depend on their position in relation to other tags:

  • child — a child is a tag inside another tag. So the two p tags above are both children of the body tag.
  • parent — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.
  • sibling — a sibling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behavior:

Here’s how this will look:

In the above example, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

  • div — indicates a division, or area, of the page.
  • b — bolds any text inside.
  • i — italicizes any text inside.
  • table — creates a table.
  • form — creates an input form.

For a full list of tags, look here.

Before we move into actual web scraping, let’s learn about the class and id properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them.

We can add classes and ids to our example:

<html>
<head>
</head>
<body>
<p class="bold-paragraph">
Here's a paragraph of text!
<a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
</p>
<p class="bold-paragraph extra-large">
Here's a second paragraph of text!
<a href="https://www.python.org" class="extra-large">Python</a>
</p>
</body>
</html>

Here’s how this will look:

As you can see, adding classes and ids doesn’t change how the tags are rendered at all.

The requests library

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GETrequest to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GETis just one. If you want to learn more, check out our API tutorial.

Let’s try downloading a simple sample website, http://dataquestio.github.io/web-scraping-pages/simple.html. We’ll need to first download it using the requests.get method.

After running our request, we get a Response object. This object has a status_codeproperty, which indicates if the page was downloaded successfully:

A status_code of 200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

We can print out the HTML content of the page using the content property:

Parsing a page with BeautifulSoup

As you can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

We can now print out the HTML content of the page, formatted nicely, using the prettifymethod on the BeautifulSoup object:

As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children property of soup. Note that children returns a list generator, so we need to call the list function on it:

['html', '\n', <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html>]

The above tells us that there are two tags at the top level of the page — the initial <!DOCTYPE html> tag, and the <html> tag. There is a newline character (\n) in the list as well. Let’s see what the type of each element in the list is:

As you can see, all of the items are BeautifulSoup objects. The first is a Doctype object, which contains information about the type of the document. The second is a NavigableString, which represents text found in the HTML document. The final item is a Tag object, which contains other nested tags. The most important object type, and the one we’ll deal with most often, is the Tag object.

The Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup objects here.

We can now select the html tag and its children by taking the third item in the list:

html = list(soup.children)[2]

Each item in the list returned by the children property is also a BeautifulSoup object, so we can also call the children method on html.

Now, we can find the children inside the html tag:

['\n', <head> <title>A simple example page</title> </head>, '\n', <body> <p>Here is some simple content for this page.</p> </body>, '\n']

As you can see above, there are two tags here, head, and body. We want to extract the text inside the p tag, so we’ll dive into the body:

body = list(html.children)[3]

Now, we can get the p tag by finding the children of the body tag:

list(body.children)['\n', <p>Here is some simple content for this page.</p>, '\n']

We can now isolate the p tag:

p = list(body.children)[1]

Once we’ve isolated the tag, we can use the get_text method to extract all of the text inside the tag:

p.get_text()'Here is some simple content for this page.'

Finding all instances of a tag at once

What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.

soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
[<p>Here is some simple content for this page.</p>]

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

soup.find_all('p')[0].get_text()'Here is some simple content for this page.'

If you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

soup.find('p')<p>Here is some simple content for this page.</p>

Searching for tags by class and id

We introduced classes and ids earlier, but it probably wasn’t clear why they were useful. Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape. To illustrate this principle, we’ll work with the following page:

We can access the above document at the URL http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html. Let’s first download the page and create a BeautifulSoupobject:

page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')soup

Now, we can use the find_all method to search for items by class or by id. In the below example, we’ll search for any p tag that has the class outer-text:

soup.find_all('p', class_='outer-text')[<p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p>, <p class="outer-text"> <b> Second outer paragraph. </b> </p>]

In the below example, we’ll look for any tag that has the class outer-text:

soup.find_all(class_="outer-text")<p class="outer-text first-item" id="second">
<b>
First outer paragraph.
</b>
</p>, <p class="outer-text">
<b>
Second outer paragraph.
</b>
</p>]

We can also search for elements by id:

soup.find_all(id="first")[<p class="inner-text first-item" id="first">
First paragraph.
</p>]

Using CSS Selectors

You can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

  • p a — finds all a tags inside of a p tag.
  • body p a — finds all a tags inside of a p tag inside of a body tag.
  • html body — finds all body tags inside of an html tag.
  • p.outer-text — finds all p tags with a class of outer-text.
  • p#first — finds all p tags with an id of first.
  • body p.outer-text — finds any p tags with a class of outer-text inside of a bodytag.

You can learn more about CSS selectors here.

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

What’s Next?

On the next intermission, we’ll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library

Additional Information

Fork me on GitHub

Check out the complete source code in the following link:

https://github.com/themavencoder/web-scraping-tutorial

--

--

Tobiloba Adejumo
Dataly.ai

Interested in biomarker development, software dev and ai, as well as psychology, history, philosophy, relationships. Website: tobilobaadejumo.com