Web Scraping with Python and Beautiful Soup

Charles Rajendran
Ascentic Technology
4 min readMay 7, 2020

For any machine learning task, the first thing we need to do is data collection. There are a number of different ways to collect data. Let’s see some of those.

1. API’s: The preferred way of data collection is consuming API’s, the reason is, API’s are well structured and also consuming an API is very simple.

2. Public Data sets: This is the second-best option, there are people who have collected data and made it available to the public. Sites like Kaggle, University of California’s Machine Learning Repository are some places where you can find public data sets.

If you could not find data from the above two methods then the last option is to go for, is web scraping. Choose the website which provides data for you, scrape the content, prepare your own data set. In this post, I’m going to show you an example of scraping the cricbuzz website blog posts.

Let’s understand the scraping procedure. It’s very simple, we have python libraries that will allow us to get the source HTML of the page, after getting the HTML, we will find the place where the required data is in from all the other junk, then extract the content with regular expressions or some other technique.

As I said before, for the extraction part we can use regular expressions, but it will make our task little hard, so in this example, I am going to use a third-party library called Beautiful Soup to extract the data from the source HTML. It’s enough theory let’s do something practical.

We will start simply by scraping a single page and then enhance it more to automate the scraping procedure. The post I am going to scrape: http://www.cricbuzz.com/cricket-news/100707/destinys-child-zimbabwes-middle-order-batsman-sikandar-raza-treats-triumphs-and-failures-the-same.

Let’s start,

Step 1 — Read the source HTML of the page.

from urllib.request import urlopen
# read the page
url = urlopen('http://www.cricbuzz.com/cricket-news/100707/destinys-child-zimbabwes-middle-order-batsman-sikandar-raza-treats-triumphs-and-failures-the-same')
html = url.read()url.close()

I used urlopen to get the HTML, if you print the HTML variable, you could see the HTML source of the page.

Step 2 — Create a Beautiful Soup object for the HTML source, so we can use Beautiful Soup functions and attributes.

# create beautiful soup object and parse the html so we can use bs methodsimport bs4
bs = bs4.BeautifulSoup(html, 'html.parser')

If you see the above code, Beautiful Soup object will take the source HTML and a parser type as arguments. Beautiful Soup will create a Tree representation of the source HTML. Different parser types will create different tree representations.

Before continuing the next step, it’s better to know some very basics of Beautiful Soup.

1. Accessing an element with Beautiful Soup (It’s as simple as putting BeautifulSoupObject.elementName)

title = bs.title# output - <title itemprop="name">Destiny's child, Raza treats 
triumphs and failures the same | Cricbuzz.com</title>

2. To get the text inside an element (BeautifulSoupObject.Element.getText())

title_text = title.getText()# output - Destiny's child, Raza treats triumphs and failures the same | Cricbuzz.com

3. To get the list of all the elements of a element type (BSObject.find_all(ElementType))

p_list = bs.find_all('p')#output - [<p class="cb-nws-para"> .....

4. Filtering list with attributes (BSObject.find_all( ‘element’, attribute_type = ‘attribute_value’))

img_list = bs.find_all('img', height='30')'''
output - [<img alt="Cricbuzz Logo" height="30" itemprop="image" src="//i.cricketcb.com/statics/site/images/cbz-logo.png"
style="bottom: -4px; position: relative;" title="Cricbuzz Logo" width="101"/>]
'''
'''
since python is an object oriented programming language class is a keyword,therefore if we want to filter the elements with class attribute we need to use class_
'''

p_list = bs.find_all('p', class_ = 'cb-nws-para')

5. Get value of a attribute (BSObject.Element.get(‘attributeName’))

bs.a.get('href')# output - https://plus.google.com/104502282508811467249 (it has got the first a tags href)

That’s all we need to know on Beautiful Soup. Let's continue with web scraping. In web scraping, before moving on to the code, we should determine how to extract the required data from the web page, for that we need to find the HTML element which contains the data.

First, let's see how we can extract the title of the post. Initially, we need to find the containing element. (Google Developer Tool’s inspect element will help.)

The title of the post here is in an element like this,

<h1 class="nws-dtl-hdln" itemprop="headline"> ...... </h1>

Then what about the content? If you see under the title in the above image, each paragraph is a section element, like this,

<section class="cb-nws-dtl-itms" itemprop="articleBody"></section>

Step 3 — Scraping the content

head_line = bs.find_all('h1', class_='nws-dtl-hdln')# extracting heading text && assigning it to the content variable
content = head_line[0].getText() + '\n\n'
# extracting paragraphs and concatinating it into content
for p in bs.find_all('section', class_='cb-nws-dtl-itms', itemprop='articleBody'):
content = content + p.getText().strip() + '\n\n'

That’s it, if you print the content you could see the scraped data.

Note: All my code is available in Github and feel free to follow me on Github 😉.

--

--