How to scrape a news website using Python, BeautyfulSoup, and Selenium to build the Word Cloud
Better visualization by using word cloud
From early 2020 to the present, what is the most frequent news in Hong Kong newspapers? Hong Kong social movement, 2019 Coronavirus or China–United States trade war? I scraped a well-known newspaper in Hong Kong, hope to use a good visualization method ~ Word Cloud to get the answer I want.
What is web scraping?
Web scraping tools are specifically used to extract information from websites. They are also known as network collection tools or Web data extraction tools.
Why doing web scraping?
The web scraping tool can be used for unlimited purposes in various scenarios. For examples:
- Collect market research data
- Collect stock markets information
- Collect contact information
- Collect data to download for offline reading or storage
- Track prices in multiple markets, etc.
How to scraping in Python?
Scrape a website in Python is very easy, especially with the help of the BeautifulSoup and Selenium library. Beautiful Soup is a Python library module that allows developers to quickly parse web page HTML code and extract useful data from it by writing a small amount of code, reducing development time, and accelerating the programming speed of the web scraping. Selenium is a tool for automated testing of web pages, which can automatically operate the browser through some of the methods it provides, and can completely simulate the operation of real people.
Before scraping a website, some preparations must be made, enter the command in Jupiter notebook, it will install the following libraries.
!pip install BeautifulSoup4
!pip install selenium
What is Word Cloud and why we need it?
Word Cloud is a very good third-party word cloud visualization library in Python. The word cloud is a visual display of keywords that appear more frequently in the text. WordCloud will filter out a lot of low-frequency and low-quality text information so that the audience can understand the main purpose of the text at a glance.
What is word segmentation?
Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. Jieba is one of the better Chinese word segmentation library, because Chinese usually contains the entire sentence, so we need to use Jieba to assist in the work of word segmentation.
Before generating a word cloud image, some preparations must be made, enter the command in Jupiter notebook, it will install the following libraries.
!pip install wordcloud
!pip install jieba
# Load the libraries
#! /usr/bin/env python
# -*- encoding UTF-8 -*-import os
import sys
from importlib import reload
reload(sys)if sys.version[0] == '2':
sys.setdefaultencoding("utf-8")
import parse
import urllib2
else:
import urllib.parse
from urllib.request import urlopenimport re
import jiebaimport pandas as pd
import numpy as np#from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ChromeOptionsfrom wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorfrom PIL import Imageimport matplotlib.pyplot as plt
%matplotlib inlinejieba.enable_paddle()
# Clean the CSS, JavaScript and HTML tag
def cleanHTML(html):
for script in html(["script", "style"]): # remove all javascript and stylesheet code
script.extract()
# get text
text = html.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
# Find the main focus news link of the day
def getNewsLink(news_date):
try:
options = ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options=options) url = 'https://orientaldaily.on.cc/cnt/news/' + str(news_date) + '/mobile/index.html' driver.get(url)
driver.implicitly_wait(30) html_source = (driver.page_source.encode('utf-8')) driver.quit() soup = BeautifulSoup(html_source, 'html.parser')
news = soup.find('div', attrs={'id':'swipe'})
main_focus = soup.find('div', attrs={'class':'main-focus-container'})
main_focus_link = 'https://orientaldaily.on.cc' + main_focus.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href'] return (main_focus_link)
except:
return 0
# Data Collection
def getNews(news_url):
options = ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options=options)
driver.get(news_url)
driver.implicitly_wait(30)
html_source = (driver.page_source.encode('utf-8'))
driver.quit()
soup = BeautifulSoup(html_source, 'html.parser')
paragraph = soup.find_all('div', attrs={'class':'paragraph'})
paragraph_list = []
for sub_paragraph in paragraph:
clean_sub_paragraph = cleanHTML(sub_paragraph)
paragraph_list.append(clean_sub_paragraph)
full_paragraph_list = [e for e in paragraph_list if e]
if (len(full_paragraph_list) > 0):
f=open("news.txt", "a+")
for i in range(len(full_paragraph_list)):
for line in full_paragraph_list[i].splitlines():
clean_paragraph = cleanText(line)
f.write(clean_paragraph)
f.write('\n\n')
f.close()
# Draw the word cloud
def draw_word_cloud(text, images_name, plt_title):
images_path = images_name
images = Image.open(images_path)
#create a write mask
images_mask = Image.new("RGB", images.size, (255,255,255))
images_mask.paste(images, images)
images_mask = np.array(images_mask)
color = ImageColorGenerator(images_mask)
#Chinese need to use another font, download it from https://www.freechinesefont.com/
font_path = 'HanyiSentyCandy.ttf'
#create wordcloud ~
wc = WordCloud(font_path=font_path, max_font_size=250, max_words=1000, mask=images_mask, \
margin=5, background_color="black").generate_from_text(text)
wc.recolor(color_func = color, random_state = 7)
#Save the image
wc.to_file("news.png")
plt.rcParams["figure.figsize"] = (16, 12)
plt.title(plt_title)
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()
# Main logic:
if __name__ == "__main__":
start_date = input("Enter the start date (yyyymmdd): ")
end_date = input("Enter the end date (yyyymmdd): ")
if ((start_date != "") and (end_date != "")):
daterange = pd.date_range(start_date, end_date)
#Download the main focus news
for news_date in daterange:
print ('Downloading ' + str(news_date) + ' news...')
single_news_date = dateConvert(news_date)
getNews(getNewsLink(single_news_date))
source_text = open('news.txt', 'r',encoding= 'UTF-8').read()
tokens = ' '.join(jieba.cut_for_search(source_text))
title = str(start_date) + ' - ' + str(end_date) + ' Main Focus News on on.cc'
draw_word_cloud(tokens, 'hongkongpng.png', title)
# Future Improvement
- Add the processing of STOPWORDS
- Use a different word segmentation library, such as thulac, FoolNLTK, HanLP, nlpir and ltp.
Thanks for reading! If you enjoyed the post, please appreciate your support by applauding via the clap (👏🏼) button below or by sharing this article so others can find it.
In the end, I hope that you can learn the scraping techniques. You can also find the full project on the GitHub repository.
References