An Introduction to Web Scraping With Python
A starters guide on how to extract data from websites
For many purposes, you may need to extract data from web sites. We call this “Web Scraping” in short. In this article, you’ll learn how to perform a basic web scraping task using Python.
First, you need to install Python to your machine.
You can easily download and install python for any platform from https://www.python.org/downloads/
There are many ways to perform Web Scraping. I will mainly discuss web scraping using “requests” and “BeautifulSoup” libraries.
Python does not include the above-mentioned libraries by default. Therefore you have to manually install these two libraries.
Let’s see how we can install the required libraries. Open your preferred terminal. Run following commands;
pip install requests
pip install beautifulsoup4
Now we are ready to use Python for Web Scrapping. Let’s learn a little bit of coding now.
Step 1 — Import request and BeautifulSoup libraries
import requests
from bs4 import BeautifulSoup
Step 2 — Connect your required internet source. (Here I have used the famous “lorem” site for this demonstration)
source = requests.get("https://www.lipsum.com/")
soup = BeautifulSoup(r.text, "html.parser")
Step 3 — Go through the required resource and examine the DOM structure of the web source. (You can do this by inspecting HTML using any web browser)
In the “lorem” site, they use <h2> tags to mark titles. So we can get all <h2> tags by using the following line of code
results = soup.find_all(“h2”)
Step 4 — Loop through the result and extract what you want
topics = []
for topic in results:
line = topic.get_text().strip()
topics.append(line)
Step 5 — You can view the output by simply printing the “topics” array
print(topics)
Refer the BeautifulSoup
library documentation for more advanced use cases
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Here I have added the full code for your reference
import requests
from bs4 import BeautifulSoupsource = requests.get("https://www.lipsum.com/")
soup = BeautifulSoup(r.text, "html.parser")results = soup.find_all("h2")topics = []
for topic in results:
line = topic.get_text().strip()
topics.append(line)print(topics)
Stay tuned for more articles!