Mastering Web Scraping with Python — Introduction part1
Web scraping, also known as screen scraping, data mining, web data extracting or web harvesting; is a method of extracting large amounts of data from a website. The extracted data are then analysed bringing out deductions and inferences from the data.
I watched an Udemy course on “Web Scraping with Python: BeautifulSoup, Requests and Selenium” by Waqar Ahmed. I also read three great books: ”Website Scraping with Python” by Gábor László Hajba, “Web Scraping with Python” by Ryan Mitchell and “Python Requests Essential” by Rakesh Vidya Chandra & Bala Subrahmanyam Varanasi, which inspired me to try it out and that is why I am sharing the knowledge with you.
- Mastering Web Scraping with Python — Introduction part 1(you are here)
- Mastering Web Scraping with Python — Introduction part 2
- Mastering Web Scraping with Python — Intermediate part 1
- Mastering Web Scraping with Python — Intermediate part 2
- Mastering Web Scraping with Python — Advanced part 1
- Mastering Web Scraping with Python — Advanced part 2
I assume that you have basic programming skills and experience in any language. I also assume that you have successfully set up a Python environment (Jupyter Notebook, PyCharm e.t.c.) on your PC . If not, download the Anaconda Distribution. The Anaconda distribution is easiest way to perform Python/R data science and machine learning on Linux, Windows, and Mac OS X. Also, it comes with pre-installed Python packages too, and of course comes with Jupyter Notebook :).
In this intermission, I mainly cited
Vik Paruchuri’s post (Python Web Scraping Tutorial using BeautifulSoup)
Libraries Needed In This Intermission
The various Python libraries listed below will be used in this series. I will explain the function of each library as we proceed in the series.
- Beautiful Soup
From the official documentation of Beautiful Soup:
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
For better understanding, what beautiful soup does is to extract data out of HTML or XML files. The extracted data isn’t arranged properly so it works together with a parser. This parser helps to arrange and sort the data into a tree based on some certain rules and conditions. After the data is now in a tree-like structure, it becomes very easy to perform necessary operations on the data.
To install beautiful soup:
pip install beautifulsoup4
From the official documentation of lxml:
lxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API. It extends the ElementTree API significantly to offer support for XPath, RelaxNG, XML Schema, XSLT, C14N and much more.
Do you understand? No? Me too 😃 .. I didn’t understand it at first. But remember that beautiful soup needs a parser; lxml is a parser. It reads the data, modifies it and also arranges data to a tree like structure based on ElementTree standard.
There are other kinds of parsers available such as the html.parser and the html5lib but the lxml parser is relatively faster than the others.
pip install lxml
From the official documentation of requests:
Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. There’s no need to manually add query strings to your URLs, or to form-encode your POST data. Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.
Requests is actually an Apache licensed library. Now the main function of requests is that it interacts with your web page. Rather than using a web browser to send messages to a web server, the requests library mimics a web browser. Therefore, the interaction with the server can be carried out programmatically by importing the requests library.
pip install requests
From the documentation:
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Pandas is an acronym for Python Data Analysis Library. It converts data from different formats to a Python object. This object is known as data frame. It is essential that we represent our data in a dataframe so as to ensure easy analysis and manipulation of data.
pip install pandas
- Components of a web page.
- The request library.
- Passing a page with beautiful soup.
- Finding all instances of a tag at once.
- Searching for tags by class and id
- Using CSS selectors
The Components of a web page
When we visit a web page, our web browser makes a request to a web server. This request is called a
GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. The files fall into a few main types:
- HTML — contain the main content of the page.
- CSS — add styling to make the page look nicer.
- Images — image formats, such as JPG and PNG allow web pages to show pictures.
After our browser receives all the files, it renders the page and displays it to us. There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look at the HTML.
HyperText Markup Language (HTML) is a language that web pages are created in. HTML isn’t a programming language, like Python — instead, it’s a markup language that tells a browser how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, create paragraphs, and so on. Because HTML isn’t a programming language, it isn’t nearly as complex as Python.
Let’s take a quick tour through HTML so we know enough to scrape effectively. HTML consists of elements called tags. The most basic tag is the
<html> tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:
We haven’t added any content to our page yet, so if we viewed our HTML document in a web browser, we wouldn’t see anything:
Right inside an
html tag, we put two other tags, the
head tag, and the
body tag. The main content of the web page goes into the
body tag. The
head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:
We still haven’t added any content to our page (that goes inside the
body tag), so we again won’t see anything:
You may have noticed above that we put the
body tags inside the
html tag. In HTML, tags are nested, and can go inside other tags.
We’ll now add our first content to the page, in the form of the
p tag. The
p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:
Here’s how this will look:
Tags have commonly used names that depend on their position in relation to other tags:
child— a child is a tag inside another tag. So the two
ptags above are both children of the
parent— a parent is the tag another tag is inside. Above, the
htmltag is the parent of the
sibling— a sibling is a tag that is nested inside the same parent as another tag. For example,
bodyare siblings, since they’re both inside
ptags are siblings, since they’re both inside
We can also add properties to HTML tags that change their behavior:
Here’s how this will look:
In the above example, we added two
a tags are links, and tell the browser to render a link to another web page. The
href property of the tag determines where the link goes.
p are extremely common html tags. Here are a few others:
div— indicates a division, or area, of the page.
b— bolds any text inside.
i— italicizes any text inside.
table— creates a table.
form— creates an input form.
For a full list of tags, look here.
Before we move into actual web scraping, let’s learn about the
id properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them.
We can add classes and ids to our example:
Here's a paragraph of text!
<a href="https://www.dataquest.io" id="learn-link">Learn Data Science Online</a>
<p class="bold-paragraph extra-large">
Here's a second paragraph of text!
<a href="https://www.python.org" class="extra-large">Python</a>
Here’s how this will look:
As you can see, adding classes and ids doesn’t change how the tags are rendered at all.
The requests library
The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library. The requests library will make a
GETrequest to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using
requests, of which
GETis just one. If you want to learn more, check out our API tutorial.
Let’s try downloading a simple sample website,
http://dataquestio.github.io/web-scraping-pages/simple.html. We’ll need to first download it using the requests.get method.
After running our request, we get a Response object. This object has a
status_codeproperty, which indicates if the page was downloaded successfully:
200 means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a
2 generally indicates success, and a code starting with a
4 or a
5 indicates an error.
We can print out the HTML content of the page using the
Parsing a page with BeautifulSoup
As you can see above, we now have downloaded an HTML document.
We can use the BeautifulSoup library to parse this document, and extract the text from the
p tag. We first have to import the library, and create an instance of the
BeautifulSoup class to parse our document:
We can now print out the HTML content of the page, formatted nicely, using the
prettifymethod on the
As all the tags are nested, we can move through the structure one level at a time. We can first select all the elements at the top level of the page using the
children property of
soup. Note that
children returns a list generator, so we need to call the
list function on it:
['html', '\n', <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html>]
The above tells us that there are two tags at the top level of the page — the initial
<!DOCTYPE html> tag, and the
<html> tag. There is a newline character (
\n) in the list as well. Let’s see what the type of each element in the list is:
As you can see, all of the items are
BeautifulSoup objects. The first is a
Doctype object, which contains information about the type of the document. The second is a
NavigableString, which represents text found in the HTML document. The final item is a
Tag object, which contains other nested tags. The most important object type, and the one we’ll deal with most often, is the
Tag object allows us to navigate through an HTML document, and extract other tags and text. You can learn more about the various
BeautifulSoup objects here.
We can now select the
html tag and its children by taking the third item in the list:
html = list(soup.children)
Each item in the list returned by the
children property is also a
BeautifulSoup object, so we can also call the
children method on
Now, we can find the children inside the
['\n', <head> <title>A simple example page</title> </head>, '\n', <body> <p>Here is some simple content for this page.</p> </body>, '\n']
As you can see above, there are two tags here,
body. We want to extract the text inside the
p tag, so we’ll dive into the body:
body = list(html.children)
Now, we can get the
p tag by finding the children of the body tag:
list(body.children)['\n', <p>Here is some simple content for this page.</p>, '\n']
We can now isolate the p tag:
p = list(body.children)
Once we’ve isolated the tag, we can use the
get_text method to extract all of the text inside the tag:
p.get_text()'Here is some simple content for this page.'
Finding all instances of a tag at once
What we did above was useful for figuring out how to navigate a page, but it took a lot of commands to do something fairly simple. If we want to extract a single tag, we can instead use the
find_all method, which will find all the instances of a tag on a page.
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')[<p>Here is some simple content for this page.</p>]
find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:
soup.find_all('p').get_text()'Here is some simple content for this page.'
If you instead only want to find the first instance of a tag, you can use the
find method, which will return a single
soup.find('p')<p>Here is some simple content for this page.</p>
Searching for tags by class and id
We introduced classes and ids earlier, but it probably wasn’t clear why they were useful. Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape. To illustrate this principle, we’ll work with the following page:
We can access the above document at the URL
http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html. Let’s first download the page and create a
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')soup
Now, we can use the
find_all method to search for items by class or by id. In the below example, we’ll search for any
p tag that has the class
soup.find_all('p', class_='outer-text')[<p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p>, <p class="outer-text"> <b> Second outer paragraph. </b> </p>]
In the below example, we’ll look for any tag that has the class
soup.find_all(class_="outer-text")<p class="outer-text first-item" id="second">
First outer paragraph.
</p>, <p class="outer-text">
Second outer paragraph.
We can also search for elements by id:
soup.find_all(id="first")[<p class="inner-text first-item" id="first">
Using CSS Selectors
You can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:
p a— finds all
atags inside of a
body p a— finds all
atags inside of a
ptag inside of a
html body— finds all
bodytags inside of an
p.outer-text— finds all
ptags with a class of
p#first— finds all
ptags with an id of
body p.outer-text— finds any
ptags with a class of
outer-textinside of a
You can learn more about CSS selectors here.
BeautifulSoup objects support searching a page via CSS selectors using the
select method. We can use CSS selectors to find all the
p tags in our page that are inside of a
div like this:
Python Web Scraping Tutorial using BeautifulSoup - Dataquest
When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since…
Installation - pandas 0.24.2 documentation
The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution…
Python XML processing with lxml
Describes the lxml package for reading and writing XML files with the Python programming language. This publication is…
Fork me on GitHub
Check out the complete source code in the following link: