How to scrape a news website using Python, BeautyfulSoup, and Selenium to build the Word Cloud

Better visualization by using word cloud

Kinder Sham
Analytics Vidhya
5 min readMay 25, 2020

--

From early 2020 to the present, what is the most frequent news in Hong Kong newspapers? Hong Kong social movement, 2019 Coronavirus or China–United States trade war? I scraped a well-known newspaper in Hong Kong, hope to use a good visualization method ~ Word Cloud to get the answer I want.

What is web scraping?

Web scraping tools are specifically used to extract information from websites. They are also known as network collection tools or Web data extraction tools.

Why doing web scraping?

The web scraping tool can be used for unlimited purposes in various scenarios. For examples:

  1. Collect market research data
  2. Collect stock markets information
  3. Collect contact information
  4. Collect data to download for offline reading or storage
  5. Track prices in multiple markets, etc.

How to scraping in Python?

Scrape a website in Python is very easy, especially with the help of the BeautifulSoup and Selenium library. Beautiful Soup is a Python library module that allows developers to quickly parse web page HTML code and extract useful data from it by writing a small amount of code, reducing development time, and accelerating the programming speed of the web scraping. Selenium is a tool for automated testing of web pages, which can automatically operate the browser through some of the methods it provides, and can completely simulate the operation of real people.

Before scraping a website, some preparations must be made, enter the command in Jupiter notebook, it will install the following libraries.

What is Word Cloud and why we need it?

Word Cloud is a very good third-party word cloud visualization library in Python. The word cloud is a visual display of keywords that appear more frequently in the text. WordCloud will filter out a lot of low-frequency and low-quality text information so that the audience can understand the main purpose of the text at a glance.

What is word segmentation?

Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. Jieba is one of the better Chinese word segmentation library, because Chinese usually contains the entire sentence, so we need to use Jieba to assist in the work of word segmentation.

Before generating a word cloud image, some preparations must be made, enter the command in Jupiter notebook, it will install the following libraries.

# Load the libraries

# Clean the CSS, JavaScript and HTML tag

# Find the main focus news link of the day

# Data Collection

# Draw the word cloud

# Main logic:

# Future Improvement

  1. Add the processing of STOPWORDS
  2. Use a different word segmentation library, such as thulac, FoolNLTK, HanLP, nlpir and ltp.

Thanks for reading! If you enjoyed the post, please appreciate your support by applauding via the clap (👏🏼) button below or by sharing this article so others can find it.

In the end, I hope that you can learn the scraping techniques. You can also find the full project on the GitHub repository.

References

--

--

Kinder Sham
Analytics Vidhya

Data scientist, cycling and game player enthusiast. Focus on how to use data science to answer questions.