How to scrape a news website using Python, BeautyfulSoup, and Selenium to build the Word Cloud

Better visualization by using word cloud

Published in

Analytics Vidhya

5 min readMay 25, 2020

From early 2020 to the present, what is the most frequent news in Hong Kong newspapers? Hong Kong social movement, 2019 Coronavirus or China–United States trade war? I scraped a well-known newspaper in Hong Kong, hope to use a good visualization method ~ Word Cloud to get the answer I want.

What is web scraping?

Web scraping tools are specifically used to extract information from websites. They are also known as network collection tools or Web data extraction tools.

Why doing web scraping?

The web scraping tool can be used for unlimited purposes in various scenarios. For examples:

Collect market research data
Collect stock markets information
Collect contact information
Collect data to download for offline reading or storage
Track prices in multiple markets, etc.

How to scraping in Python?

Scrape a website in Python is very easy, especially with the help of the BeautifulSoup and Selenium library. Beautiful Soup is a Python library module that allows developers to quickly parse web page HTML code and extract useful data from it by writing a small amount of code, reducing development time, and accelerating the programming speed of the web scraping. Selenium is a tool for automated testing of web pages, which can automatically operate the browser through some of the methods it provides, and can completely simulate the operation of real people.

Before scraping a website, some preparations must be made, enter the command in Jupiter notebook, it will install the following libraries.

!pip install BeautifulSoup4
!pip install selenium

What is Word Cloud and why we need it?

Word Cloud is a very good third-party word cloud visualization library in Python. The word cloud is a visual display of keywords that appear more frequently in the text. WordCloud will filter out a lot of low-frequency and low-quality text information so that the audience can understand the main purpose of the text at a glance.

What is word segmentation?

Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. Jieba is one of the better Chinese word segmentation library, because Chinese usually contains the entire sentence, so we need to use Jieba to assist in the work of word segmentation.

Before generating a word cloud image, some preparations must be made, enter the command in Jupiter notebook, it will install the following libraries.

!pip install wordcloud
!pip install jieba

# Load the libraries

#! /usr/bin/env python
# -*- encoding UTF-8 -*-import os
import sys
from importlib import reload
reload(sys)if sys.version[0] == '2':
    sys.setdefaultencoding("utf-8")
    
    import parse
    import urllib2
else:
    import urllib.parse
    from urllib.request import urlopenimport re
import jiebaimport pandas as pd
import numpy as np#from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import ChromeOptionsfrom wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorfrom PIL import Imageimport matplotlib.pyplot as plt
%matplotlib inlinejieba.enable_paddle()

# Clean the CSS, JavaScript and HTML tag

def cleanHTML(html):
    for script in html(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = html.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    
    return text

# Find the main focus news link of the day

def getNewsLink(news_date):
    try:
        options = ChromeOptions()
        options.add_argument('headless')
        driver = webdriver.Chrome(options=options)        url = 'https://orientaldaily.on.cc/cnt/news/' + str(news_date) + '/mobile/index.html'        driver.get(url)
        driver.implicitly_wait(30)        html_source = (driver.page_source.encode('utf-8'))        driver.quit()        soup = BeautifulSoup(html_source, 'html.parser')
        news = soup.find('div', attrs={'id':'swipe'})
        main_focus = soup.find('div', attrs={'class':'main-focus-container'})
        main_focus_link = 'https://orientaldaily.on.cc' + main_focus.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']        return (main_focus_link)
    except:
        return 0

# Data Collection

def getNews(news_url):
    options = ChromeOptions()
    options.add_argument('headless')
    driver = webdriver.Chrome(options=options)
    
    driver.get(news_url)
    driver.implicitly_wait(30)
        
    html_source = (driver.page_source.encode('utf-8'))
        
    driver.quit()
    
    soup = BeautifulSoup(html_source, 'html.parser')
    paragraph = soup.find_all('div', attrs={'class':'paragraph'})
    paragraph_list = []
    
    for sub_paragraph in paragraph:
        clean_sub_paragraph = cleanHTML(sub_paragraph)
        paragraph_list.append(clean_sub_paragraph)
        
    full_paragraph_list = [e for e in paragraph_list if e]
    
    if (len(full_paragraph_list) > 0):
        f=open("news.txt", "a+")
        
        for i in range(len(full_paragraph_list)):
            
            
            for line in full_paragraph_list[i].splitlines():
                clean_paragraph = cleanText(line)
                
                f.write(clean_paragraph)
                
        f.write('\n\n')
        f.close()

# Draw the word cloud

def draw_word_cloud(text, images_name, plt_title):
    images_path = images_name
    images = Image.open(images_path)
    
    #create a write mask 
    images_mask = Image.new("RGB", images.size, (255,255,255))
    images_mask.paste(images, images)
    images_mask = np.array(images_mask)
    
    color = ImageColorGenerator(images_mask)
    
    #Chinese need to use another font, download it from https://www.freechinesefont.com/
    font_path = 'HanyiSentyCandy.ttf'
    
    #create wordcloud ~ 
    wc = WordCloud(font_path=font_path, max_font_size=250, max_words=1000, mask=images_mask, \
                   margin=5, background_color="black").generate_from_text(text)
    wc.recolor(color_func = color, random_state = 7)
    
    #Save the image
    wc.to_file("news.png")
    
    plt.rcParams["figure.figsize"] = (16, 12)
    
    plt.title(plt_title)
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()

# Main logic:

if __name__ == "__main__":
    start_date = input("Enter the start date (yyyymmdd): ")
    end_date = input("Enter the end date (yyyymmdd): ")
    
    if ((start_date != "") and (end_date != "")):
        daterange = pd.date_range(start_date, end_date)
        
        #Download the main focus news
        for news_date in daterange:
            print ('Downloading ' + str(news_date) + ' news...')
            single_news_date = dateConvert(news_date)
            getNews(getNewsLink(single_news_date))
            
        source_text = open('news.txt', 'r',encoding= 'UTF-8').read()
        tokens = ' '.join(jieba.cut_for_search(source_text))
        
        title = str(start_date) + ' - ' + str(end_date) + ' Main Focus News on on.cc'
        
        draw_word_cloud(tokens, 'hongkongpng.png', title)

# Future Improvement

Add the processing of STOPWORDS
Use a different word segmentation library, such as thulac, FoolNLTK, HanLP, nlpir and ltp.

Thanks for reading! If you enjoyed the post, please appreciate your support by applauding via the clap (👏🏼) button below or by sharing this article so others can find it.

In the end, I hope that you can learn the scraping techniques. You can also find the full project on the GitHub repository.

References

[NLP] Four Ways to Tokenize Chinese Documents

With some empirical evidence in support of sub-word segmentation techniques

medium.com

WordCloud for Python documentation - wordcloud 1.6.0.post54+gb870feb documentation

Here you find instructions on how to create wordclouds with my Python wordcloud project. Compared to other wordclouds…

amueller.github.io