Mastering Web Scraping with Python — Introduction part1
Web scraping, also known as screen scraping, data mining, web data extracting or web harvesting; is a method of extracting large amounts of data from a website. The extracted data are then analysed bringing out deductions and inferences from the data.
Preface
I watched an Udemy course on “Web Scraping with Python: BeautifulSoup, Requests and Selenium” by Waqar Ahmed. I also read three great books: ”Website Scraping with Python” by Gábor László Hajba, “Web Scraping with Python” by Ryan Mitchell and “Python Requests Essential” by Rakesh Vidya Chandra & Bala Subrahmanyam Varanasi, which inspired me to try it out and that is why I am sharing the knowledge with you.
Series Intermission
- Mastering Web Scraping with Python — Introduction part 1(you are here)
- Mastering Web Scraping with Python — Introduction part 2
- Mastering Web Scraping with Python — Intermediate part 1
- Mastering Web Scraping with Python — Intermediate part 2
- Mastering Web Scraping with Python — Advanced part 1
- Mastering Web Scraping with Python — Advanced part 2
Prerequisite
I assume that you have basic programming skills and experience in any language. I also assume that you have successfully set up a Python environment (Jupyter Notebook, PyCharm e.t.c.) on your PC . If not, download the Anaconda Distribution. The Anaconda distribution is easiest way to perform Python/R data science and machine learning on Linux, Windows, and Mac OS X. Also, it comes with pre-installed Python packages too, and of course comes with Jupyter Notebook :).
In this intermission, I mainly cited
Vik Paruchuri’s post (Python Web Scraping Tutorial using BeautifulSoup)
Libraries Needed In This Intermission
The various Python libraries listed below will be used in this series. I will explain the function of each library as we proceed in the series.
- Beautiful Soup
- lxml
- Requests
- Pandas
#Beautiful Soup
From the official documentation of Beautiful Soup:
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
For better understanding, what beautiful soup does is to extract data out of HTML or XML files. The extracted data isn’t arranged properly so it works together with a parser. This parser helps to arrange and sort the data into a tree based on some certain rules and conditions. After the data is now in a tree-like structure, it becomes very easy to perform necessary operations on the data.
To install beautiful soup:
pip install beautifulsoup4
#lxml
From the official documentation of lxml:
lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more.
Do you understand? No? Me too 😃 .. I didn’t understand it at first. But remember that beautiful soup needs a parser; lxml is a parser. It reads the data, modifies it and also arranges data to a tree like structure based on ElementTree standard.
There are other kinds of parsers available such as the html.parser and the html5lib but the lxml parser is relatively faster than the others.
pip install lxml
#Requests
From the official documentation of requests:
Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.
Requests is actually an Apache licensed library. Now the main function of requests is that it interacts with your web page. Rather than using a web browser to send messages to a web server, the requests library mimics a web browser. Therefore, the interaction with the server can be carried out programmatically by importing the requests library.
pip install requests
#Pandas
From the documentation:
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Pandas is an acronym for Python Data Analysis Library. It converts data from different formats to a Python object. This object is known as data frame. It is essential that we represent our data in a dataframe so as to ensure easy analysis and manipulation of data.
pip install pandas
Intermission index
- Components of a web page.
- HTML
- The request library.
- Passing a page with beautiful soup.
- Finding all instances of a tag at once.
- Searching for tags by class and id
- Using CSS selectors
The Components of a web page
When we visit a web page, our web browser makes a request to a web server. This request is called a GET
request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:
- HTML — contain the main content of the page.
- CSS — add styling to make the page look nicer.
- JS — Javascript files add interactivity to web pages.
- Images — image formats, such as JPG and PNG allow web pages to show pictures.
After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML.
HTML
HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, create paragraphs, and so on. Because HTML isn’t a programming language, it isn’t nearly as complex as Python.
Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the <html>
tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:
We haven’t added any content to our page yet, so if we viewed our HTML document in a web browser, we wouldn’t see anything:
Right inside an html
tag, we put two other tags, the head
tag, and the body
tag. The main content of the web page goes into the body
tag. The head
tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:
We still haven’t added any content to our page (that goes inside the body
tag), so we again won’t see anything:
You may have noticed above that we put the head
and body
tags inside the html
tag. In HTML, tags are nested, and can go inside other tags.
We’ll now add our first content to the page, in the form of the p
tag. The p
tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:
Here’s how this will look:
Tags have commonly used names that depend on their position in relation to other tags:
child
— a child is a tag inside another tag. So the twop
tags above are both children of thebody
tag.parent
— a parent is the tag another tag is inside. Above, thehtml
tag is the parent of thebody
tag.sibling
— a sibling is a tag that is nested inside the same parent as another tag. For example,head
andbody
are siblings, since they’re both insidehtml
. Bothp
tags are siblings, since they’re both insidebody
.
We can also add properties to HTML tags that change their behavior:
Here’s how this will look:
In the above example, we added two a
tags. a
tags are links, and tell the browser to render a link to another web page. The href
property of the tag determines where the link goes.
a
and p
are extremely common html tags. Here are a few others:
div
— indicates a division, or area, of the page.b
— bolds any text inside.i
— italicizes any text inside.table
— creates a table.form
— creates an input form.
For a full list of tags, look here.
Before we move into actual web scraping, let’s learn about the class
and id
properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them.
We can add classes and ids to our example:
<html>
<head>
</head>
<body>
<p class="bold-paragraph">
Here's a paragraph of text!
<a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
</p>
<p class="bold-paragraph extra-large">
Here's a second paragraph of text!
<a href="https://www.python.org" class="extra-large">Python</a>
</p>
</body>
</html>
Here’s how this will look:
As you can see, adding classes and ids doesn’t change how the tags are rendered at all.
The requests library
The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a GET
request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests
, of which GET
is just one. If you want to learn more, check out our API tutorial.
Let’s try downloading a simple sample website, http://dataquestio.github.io/web-scraping-pages/simple.html
. We’ll need to first download it using the requests.get method.
After running our request, we get a Response object. This object has a status_code
property, which indicates if the page was downloaded successfully:
A status_code
of 200
means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2
generally indicates success, and a code starting with a 4
or a 5
indicates an error.
We can print out the HTML content of the page using the content
property:
Parsing a page with BeautifulSoup
As you can see above, we now have downloaded an HTML document.
We can use the BeautifulSoup library to parse this document, and extract the text from the p
tag. We first have to import the library, and create an instance of the BeautifulSoup
class to parse our document:
We can now print out the HTML content of the page, formatted nicely, using the prettify
method on the BeautifulSoup
object:
As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the children
property of soup
. Note that children
returns a list generator, so we need to call the list
function on it:
['html', '\n', <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html>]
The above tells us that there are two tags at the top level of the page — the initial <!DOCTYPE html>
tag, and the <html>
tag. There is a newline character (\n
) in the list as well. Let’s see what the type of each element in the list is:
As you can see, all of the items are BeautifulSoup
objects. The first is a Doctype
object, which contains information about the type of the document. The second is a NavigableString
, which represents text found in the HTML document. The final item is a Tag
object, which contains other nested tags. The most important object type, and the one we’ll deal with most often, is the Tag
object.
The Tag
object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various BeautifulSoup
objects here.
We can now select the html
tag and its children by taking the third item in the list:
html = list(soup.children)[2]
Each item in the list returned by the children
property is also a BeautifulSoup
object, so we can also call the children
method on html
.
Now, we can find the children inside the html
tag:
['\n', <head> <title>A simple example page</title> </head>, '\n', <body> <p>Here is some simple content for this page.</p> </body>, '\n']
As you can see above, there are two tags here, head
, and body
. We want to extract the text inside the p
tag, so we’ll dive into the body:
body = list(html.children)[3]
Now, we can get the p
tag by finding the children of the body tag:
list(body.children)['\n', <p>Here is some simple content for this page.</p>, '\n']
We can now isolate the p tag:
p = list(body.children)[1]
Once we’ve isolated the tag, we can use the get_text
method to extract all of the text inside the tag:
p.get_text()'Here is some simple content for this page.'
Finding all instances of a tag at once
What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the find_all
method, which will find all the instances of a tag on a page.
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')[<p>Here is some simple content for this page.</p>]
Note that find_all
returns a list, so we’ll have to loop through, or use list indexing, it to extract text:
soup.find_all('p')[0].get_text()'Here is some simple content for this page.'
If you instead only want to find the first instance of a tag, you can use the find
method, which will return a single BeautifulSoup
object:
soup.find('p')<p>Here is some simple content for this page.</p>
Searching for tags by class and id
We introduced classes and ids earlier, but it probably wasn’t clear why they were useful. Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape. To illustrate this principle, we’ll work with the following page:
We can access the above document at the URL http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html
. Let’s first download the page and create a BeautifulSoup
object:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')soup
Now, we can use the find_all
method to search for items by class or by id. In the below example, we’ll search for any p
tag that has the class outer-text
:
soup.find_all('p', class_='outer-text')[<p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p>, <p class="outer-text"> <b> Second outer paragraph. </b> </p>]
In the below example, we’ll look for any tag that has the class outer-text
:
soup.find_all(class_="outer-text")<p class="outer-text first-item" id="second">
<b>
First outer paragraph.
</b>
</p>, <p class="outer-text">
<b>
Second outer paragraph.
</b>
</p>]
We can also search for elements by id:
soup.find_all(id="first")[<p class="inner-text first-item" id="first">
First paragraph.
</p>]
Using CSS Selectors
You can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:
p a
— finds alla
tags inside of ap
tag.body p a
— finds alla
tags inside of ap
tag inside of abody
tag.html body
— finds allbody
tags inside of anhtml
tag.p.outer-text
— finds allp
tags with a class ofouter-text
.p#first
— finds allp
tags with an id offirst
.body p.outer-text
— finds anyp
tags with a class ofouter-text
inside of abody
tag.
You can learn more about CSS selectors here.
BeautifulSoup
objects support searching a page via CSS selectors using the select
method. We can use CSS selectors to find all the p
tags in our page that are inside of a div
like this:
What’s Next?
On the next intermission, we’ll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library
Additional Information
Fork me on GitHub
Check out the complete source code in the following link: