Introduction to web scraping

What is web scraping?

Web scraping is a technique for gathering data or information on web pages. A scraper is a script that parses an html site. Scrapers are bound to fail in cases of site re-design.

Some of the languages that support scraping and the libraries involved:

  • Python - Scrapy and beautifulsoup
  • Java - Jaunt, Jsoup
  • node.js - Noodle, Osmosis(Read more here)

As much as there’re many libraries that support web scraping, we will delve into web scraping using python libraries.

Web Scraping in python

Why python?

Python is one of the most popular languages for web crawling. In support, there are two popular libraries that are used for this purpose; Beautiful soup and Scrapy . These libraries are widely used in python for scraping.

In this tutorial, I will use beautiful soup since it is easy and highly efficient. The tutorial is broken into manageable steps that I bet, will get you up to speed with scraping.

Installation

This tutorial assumes that the reader is an absolute beginner or has never coded in python.


  • Download python: For mac users, python is embedded on your os by default. You can opt to download if you prefer using python3. For windows’ users configure a path for python in the system environment variables. Check this .
  • Setting up virtual environment. A Virtual Environment is a tool to keep the dependencies required by different projects in separate places, by creating virtual Python environments for them. Virtualenv creates isolated python environments.
  • Using virtualenv wrappers. It’s an extension of the virtualenv tool.The extensions include wrappers for creating and deleting virtual environments and otherwise managing your development workflow, making it easier to work on more than one project at a time without introducing conflicts in their dependencies. Check this .
  • Install your favourite text editor. Sublime text ? or Atom ?

I presume you’ve gone through the initial steps and are ready to delve into scrapping.

In your virtual environment, run the command:

pip install beautifulsoup4

Open your text editor and create a file scraper.py

In this tutorial, we will scrape one of the Kenya’s media houses’ website, the standard media, specifically, the trending news section of the website.

Copy the code below to your scraper file.

import json
import requests
from bs4 import BeautifulSoup
class Scraper(object):
def __init__(self):
self.url = “https://www.standardmedia.co.ke"
def scrape_site(self):
res = requests.get(self.url)
html = BeautifulSoup(res.content, ‘html.parser’)
if html:
div = html.find(“div”, class_=”col-xs-6 col-md-6 zero”)
ul = div.find_all(“ul”, class_=”sub-stories-2")
data = []
for item in ul:
img_url = item.find(“img”).get(“src”)
text = item.find(“img”).get(“alt”)
link = item.find(“li”).find(“a”).get(“href”)
data.append({
‘title’: text,
‘link’: link,
‘img’: img_url
})
return json.dumps(data, indent=2)
scraper = Scraper()
print(scraper.scrape_site())

Let’s discuss the code above:

  • The first four lines are imports. These are python libraries that we used to ensure our code runs successfully.
  • I decided to implement the functionality using OOP (Object Oriented Programming). Scraper is our class which is the template for all scraping. In the init method we have initialized a url that we will get data from the url.
  • The method scrape_site does the scraping.
res = requests.get(self.url)

This statement pings the given url and returns the status code as response. In this case res returns status code as 200.

html = BeautifulSoup(res.content, ‘html.parser’)

This statement shows the an HTML-parser that’s creates a parse tree for the pages passed to it that extracts data from the html page.

Note: It’s key to have some HTML knowledge, it will help in unpacking data which is contained in html tags and css tags. Find more information on HTML here.

Once the parser returns html content, it’s time to unpack it and get the content we want.

div = html.find(“div”, class_=”col-xs-6 col-md-6 zero”)
ul = div.find_all(“ul”, class_=”sub-stories-2")

The two statements above demonstrate how to extract data from the HTML content using HTML and CSS (Cascading Style Sheets) tags. The div variable is a result set of the HTML containing all the div tags with the class col-xs-6 col-md-6 zero .

The ul variable is a list of all the ul tags under a CSS class sub-stories-2 and the data they contain.

for item in ul:
img_url = item.find(“img”).get(“src”)
text = item.find(“img”).get(“alt”)
link = item.find(“li”).find(“a”).get(“href”)
data.append({
‘title’: text,
‘link’: link,
‘img’: img_url
})

The statements above extract the data we want to get from the subsection we obtained from the entire website. In this case we want to get the title, link and image of the stories. We store the result we get in a list data.


Web scraping is very important especially for data enthusiasts.

Thanks for reading, and I hope this tutorial is of help.