[Hands-on] Tutorial on Web Scraping with Python for Beginner
--
This is a hands-on to web scraping, so I won’t talk about the theory like what data/web scraping is or why we need it. Let’s get to real thing!
In this project we will collect information from the website of CGTN. CGTN is an international media outlet in China and their servers are probably in China which is why their website takes a long time to load. The website also has a lot of heavy animations which makes the user experience not so nice. That’s the perfect reason to scrap their website, haha.
I going to assume that you already have Python basic knowledge and have it installed. Let’s make a new directory for our project. I called mine ‘scraping-cgtn’. Inside the directory, open your terminal or command prompt. Enter this command:
python -m virtualenv ./venv
This will create a new directory with a copy of Python from your system. That is called virtual environment. The reason we use that is we don’t want to mix all the libraries we use for this project with other projects.
After creating the virtual environment, we will activate it with this command:
# Mac OS or Linux
source venv/bin/activate# Windows
venv\Scripts\activate
Notice that there is venv
in front of the line now:
(venv) C:\Users\steve\Documents\scraping-cgtn>
It means we are using virtual environment now for our project. Don’t close your terminal or command prompt yet. If you close it, you need to activate the virtual environment again.
We need to install BeautifulSoup. Run this in the terminal and wait:
pip install beautifulsoup4
Let’s create a new file called top_news.py
. Add these lines at the top of the file
from bs4 import BeautifulSoup
import requests
We are using requests
to make requests to the website and BeautifulSoup to parse the response.
Now we send our request and retrieving the response:
url = "https://www.cgtn.com/"
response = requests.get(url)
response_text = response.text
As you can see, we make our request with the function get
. That means we are requesting the resource with the GET
method.
Create an instance of BeautifulSoup to parse the response_text
.
soup = BeautifulSoup(response_text, 'html.parser')
Your top_news.py
file should now look like this:
from bs4 import BeautifulSoup
import requests
url = "https://www.cgtn.com/"
response = requests.get(url)
response_text = response.text
soup = BeautifulSoup(response_text, 'html.parser')
The information we are trying to get is the list of top news by CGTN. Let’s analyze how these information are placed inside the website.
Right click on any blank space in the website. Click inspect element. Then click the element picker tool.
With this tool you can see how every elements are placed in the HTML page. What we want to find now is the location of the title of the top news.
This div
with class top-news-item-content
has the information we need, which are the title, published time, and the category of the news.
In the website, there are 5 top news. We need to verify that there are indeed 5 elements of div
with class top-news-item-content
. Add these code.
top_news_parents = soup.find_all(attrs={'class': 'top-news-item-content'})print(len(top_news_parents))
Run the top_news.py
file with this command:
python top_news.py
It should return 5
as the output. Cool, we have the parents element of all the top news. Delete the print statement in the last line as we don’t need it anymore.
We need to go over each parent element and navigate down to find the title, published time, and category of each top news. That sounds exactly like a loop. Let’s get them!
Make an empty list to save the news.
news_list = []
Then we are going to start our loop:
for parent in top_news_parents:
Look closely and you can find that the title are in an anchor element:
Add this code inside you loop: (don’t forget to add 4 spaces in front of the code)
news_anchor = parent.find('a')
news_title = news_anchor.text.strip()
print(news_title)
In the first line, we find the anchor element. Then we obtained the text. We called the strip
function to remove the newline from the beginning and the end of the text.
Run the top_news.py
again and you should see 5 top news in the output.
China, Singapore vow to enhance cooperation
Carrie Lam confident in HK's integration into national development
Israeli PM visits Egypt in first official trip in a decade
Norway's leftist opposition wins election with coalition talks ahead
18th China-ASEAN Expo concludes with record deals
Remove the last print statement. We are going forward to obtain the published time.
The published time are in a span
element with class publishTime
. Add this code inside your loop:
news_time_span = parent.find('span', attrs={'class': 'publishTime'})
news_time = news_time_span.text.strip()
print(news_time)
We call the find function with the attrs
parameter which is a dictionary defining that the class should be publishTime
.The output should be similar to this:
07:17, 14-Sep-2021
09:32, 14-Sep-2021
11:45, 14-Sep-2021
08:35, 14-Sep-2021
10:23, 14-Sep-2021
Remove the last print
statement. We are going to obtain the category now.
It is also inside a span
with class property
. Add these code inside your loop and run the file again.
category_span = parent.find('span', attrs={'class': 'property'})
category = category_span.text.strip()
print(category)
The output should be similar to this:
Asia
China
Middle East
Europe
Economy
Remove the print statement again as we don’t need it anymore. We are going to save every news inside the news_list
that we create before. Add these line inside loop:
news = {
'title': news_title,
'time': news_time,
'category': category,
}
news_list.append(news)
We are creating a dictionary for every news and append it to news_list
.
Let’s see the final result of news_list
. Now add this code outside of the loop:
print(news_list)
Your top_news.py
should look like this:
from bs4 import BeautifulSoup
import requests
url = "https://www.cgtn.com/"
response = requests.get(url)
response_text = response.text
soup = BeautifulSoup(response_text, 'html.parser')
top_news_parents = soup.find_all(attrs={'class': 'top-news-item-content'})
news_list = []
for parent in top_news_parents:
news_anchor = parent.find('a')
news_title = news_anchor.text.strip()
news_time_span = parent.find('span', attrs={'class': 'publishTime'})
news_time = news_time_span.text.strip()
category_span = parent.find('span', attrs={'class': 'property'})
category = category_span.text.strip()
news = {
'title': news_title,
'time': news_time,
'category': category,
}
news_list.append(news)
print(news_list)
The output should be like this:
[{'title': 'China, Singapore vow to enhance cooperation', 'time': '07:17, 14-Sep-2021', 'category': 'Asia'}, {'title': "Carrie Lam confident in HK's integration into national development", 'time': '09:32, 14-Sep-2021', 'category': 'China'}, {'title': 'Israeli PM visits Egypt in first official trip in a decade', 'time': '11:45, 14-Sep-2021', 'category': 'Middle East'}, {'title': 'Blinken defends Afghan withdrawal at testy U.S. congressional hearing', 'time': '12:39, 14-Sep-2021', 'category': 'World'}, {'title': '18th China-ASEAN Expo concludes with record deals', 'time': '10:23, 14-Sep-2021', 'category': 'Economy'}]
Congratulations, you have finished this hands-on. You have successfully scrap the top news from CGTN and the data is ready for further processing.