An Introduction to Web Scraping With Python

A starters guide on how to extract data from websites

Published in

LinkIT

2 min readMay 15, 2020

For many purposes, you may need to extract data from web sites. We call this “Web Scraping” in short. In this article, you’ll learn how to perform a basic web scraping task using Python.

First, you need to install Python to your machine.

You can easily download and install python for any platform from https://www.python.org/downloads/

There are many ways to perform Web Scraping. I will mainly discuss web scraping using “requests” and “BeautifulSoup” libraries.

Python does not include the above-mentioned libraries by default. Therefore you have to manually install these two libraries.

Let’s see how we can install the required libraries. Open your preferred terminal. Run following commands;

pip install requests
pip install beautifulsoup4

Now we are ready to use Python for Web Scrapping. Let’s learn a little bit of coding now.

Step 1 — Import request and BeautifulSoup libraries

import requests
from bs4 import BeautifulSoup

Step 2 — Connect your required internet source. (Here I have used the famous “lorem” site for this demonstration)

source = requests.get("https://www.lipsum.com/")
soup = BeautifulSoup(r.text, "html.parser")

Step 3 — Go through the required resource and examine the DOM structure of the web source. (You can do this by inspecting HTML using any web browser)

In the “lorem” site, they use <h2> tags to mark titles. So we can get all <h2> tags by using the following line of code

results = soup.find_all(“h2”)

Step 4 — Loop through the result and extract what you want

topics = [] 
for topic in results:
    line = topic.get_text().strip()
    topics.append(line)

Step 5 — You can view the output by simply printing the “topics” array

print(topics)

Refer the BeautifulSoup library documentation for more advanced use cases
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Here I have added the full code for your reference

import requests
from bs4 import BeautifulSoupsource = requests.get("https://www.lipsum.com/")
soup = BeautifulSoup(r.text, "html.parser")results = soup.find_all("h2")topics = [] 
for topic in results:
    line = topic.get_text().strip()
    topics.append(line)print(topics)

Stay tuned for more articles!

References
Python Installation
BeautifulSoup Documentation

An Introduction to Web Scraping With Python

A starters guide on how to extract data from websites

Written by Yohan Kulasinghe