Extract RSS News Feeds using Python and Google Cloud Services

An Introduction to web scraping and to Serverless Cloud services.

Izzat Demnati
Analytics Vidhya

--

Photo by Roman Kraft on Unsplash

The purpose of this article is to present au systematic approach to read an RSS News Feed and to process its content to web scrape news articles. The challenge is to be able to extract text articles published in different websites without any strong premise on a web page structure.

The overall solution is described in three steps :

  1. A message is published in Cloud Pub/Sub with a URL to an news RSS feed,
  2. A first Cloud Function is triggered by the previous message. It extracts each article within the RSS feed, stores it in Cloud Storage and publishes a message for each article in Cloud Pub/Sub for further usage,
  3. A second Cloud Function is triggered by the previous messages. It web scrapes the article page, stores the resulting text in Cloud Storage and publishes a message in Cloud Pub/Sub for further usage.
News RSS feed web scraping — Solution diagram

RSS Feed data processing

Let say we would like to listen to an RSS Feed news and loop on each link to extract web page articles.

Fortunately, RSS Feed is written in XML and must conform to specifications. So dealing with one or multiple RSS Feed should be the same. Here is an example of Google News RSS Feed when I searched for articles related to “Apple”.

RSS Feed example — XML file

As we see, each item is an article described by a title, a link to the original web page, a publication date and a description summary. Let see how to use Python to run a search and how to loop trough each item to gather article links.

First we need to install feedparser library to parse RSS Feed xml files:

pip install feed parser

We also need to install pandas to use pandas.io.json.json_normalize to flatten Json data to a pandas data table: pip install pandas

The output of the following code is a set of URLs that will be used in the next section to extract articles using a web scraping library.

RSS Feed data parser — Python code

Web scraping news articles

The objective is to extract articles from news website without strong dependency on any web page structure.

If we look at news websites page structure, we notice that the main article is surrounded by some noise like images, ads, links, etc. that we should ignore during the extraction process.

Here is an example of an article from the Guardian website:

Guardian example — Top Page
Guardian example — Bottom Page

If we look at the page source bellow we notice that the <article> HTML element is used to self-contain the main text which is spitted on multiple paragraphs <p> . This structure is common in news websites. However there is some exceptions so we don’t want to make strong assumptions on the page structure, to be able to parse any news web page.

The only assumption that we will work with is that an article is a concatenation of one or many paragraph blocks and each web page contains one main article that we want to extract.

Guardian example — source code

Here is another example from Reuters website where the <article> element is not used to self-contained the article but the previous assumption is still true:

The article of interest is a concatenation of paragraph blocks and each web page contains one main article.

We notice from the previous page source that, instead of using the <article> element, the element <div class=”ArticlePage_container> is used to group the main article paragraph blocks.

Reuters example — source code

So, we might consider extracting all paragraph elements within a page and concatenating them as a single article. However, not all paragraph blocks in a web page are related to the main article, so we should find a way to concatenate only paragraphs of interest.

The solution proposed is easy to implement, not perfect, but worked well when tested over different websites:

  1. extract all paragraph elements inside the page body,
  2. for each paragraph, construct its patents elements hierarchy,
  3. concatenate paragraphs under the same parent hierarchy,
  4. select the longest paragraph as the main article.

The assumption is that the main article should represent the highest web page content, while the other web page components, like ads, images, links, promoted articles summaries, etc., should be individually marginal.

To implement this solution, lets install some useful libraries:

  • The requests library will be used to request http URLs: pip install requests
  • The beautifulsoup4 library will be used to extract data from an html content type: pip install beautifulsoup4
  • The pandas library will be used to collect articles in DataFrame format: pip install pandas

Here is a Python implementation in order to process one article:

web scraping — Python code

So now we can loop over each item within the RSS Feed result to extract all articles.

Automatization using Google Cloud Platform

In this section we will use Google Cloud Platform (GCP) to build a scalable solution for searching any news related to a keyword, extracting the news articles and storing them for future processing:

  1. Cloud Pub/Sub will be used to build an event-driven solution to start a search-and-extract process,
  2. Cloud Function will be used to make our Python code available as a service,
  3. Cloud Storage will be used to persist the resulting articles as Json files.

You can start using GCP for free with your Gmail account. You can get 300$ free credits that you can use for 12 months.

Cloud Pub/Sub

To start using Cloud Pub/Sub, we should create a topic. A topic is a resource that groups related messages exchanged between publishers and subscribers.

Is our case, we will publish messages to three topics :

  1. news_search_to_process: messages published in this topic will trigger a Cloud Function to process an RSS feed,
  2. news_article_to_process: messages published in this topic will trigger a Cloud Function to web scrape a news web page,
  3. news_article_processed: messages published in this topic will notify subscribers for further processing.

We can use the Google Cloud console to create a topic or we can use a command-line using the Google Cloud SDK that we can install and configure in our local environment:

gcloud pubsub topics create news_search_to_process
gcloud pubsub topics create news_article_to_process
gcloud pubsub topics create news_article_processed
Cloud Pub/Sub — Create a topic

We also need to define a message to be able to run news search for different RSS feeds or different keywords.

To keep the example simple, let publish a simple URL as a message attribute that will be used by a Cloud Function to start the process:

url = 'https://news.google.com/rss/search?q=apple'

We can use the console to manually publishing a message to a topic:

Cloud Pub/Sub — Topic details
Cloud Pub/Sub — Publish message to a topic

Cloud Function

In order to deploy a Cloud Function for our solution, we need to create these two files:

  1. main.py file that contains our python function,
  2. requirements.txt file that lists the python libraries that we need to install. For our purposes, it will contain the following libraries:
google-cloud-storage
google-cloud-pubsub
pandas
feedparser
bs4

The main.py file contains :

  1. process_rss_feed_search(): This Cloud Function will be triggered when we will publish a message to the Cloud Pub/Sub topic news_search_to_process.
  2. process_rss_feed_article(): This Cloud Function will be triggered when we will publish a message to the Cloud Pub/Sub topic news_article_to_process.
  3. news_feed_parser: This class is used to to parse RSS news feeds URLs and to web scrape news web pages.
  4. gcp_cloud_util: This class is used to connect to Google Cloud Platform and to encapsulate GCP services as Cloud Storage and Cloud Pub/Sub in this case.

Now we need to deploy two Cloud Functions triggered by messages publish to the Cloud Pub/Sub topics. We can execute a command line using the Google Cloud SDK:

gcloud functions deploy process_rss_feed_search --runtime python37 --trigger-topic news_search_to_processgcloud functions deploy process_rss_feed_article --runtime python37 --trigger-topic news_article_to_process

We are now ready to test our solution by publishing a message to Cloud Pub/Sub topic news_search_to_process:

Message published to topic: news_search_to_process

Cloud Function process_rss_feed_search() ran successfully and 100 articles extracted from google News RSS Feed were processed:

Cloud Function process_rss_feed_search() logs

Cloud Function process_rss_feed_article() ran successfully 100 times to processed published articles:

Cloud Function process_rss_feed_article() logs

100 news articles were extracted from web pages and Json files were created in Cloud Storage:

Cloud Storage — News articles Json files

Here is a resulting Json file example:

News article — Json file example

I hope you enjoyed this introduction to web scraping and to Serverless Cloud services. In a future post, I will present a solution on how to extract entities from news articles, using Spacy python package, and how to securely expose a Web Service using Cloud Endpoints.

--

--

Izzat Demnati
Analytics Vidhya

Passionate about data science, I am eager to learn and share about innovation, technology and machine learning.