Extract RSS News Feeds using Python and Google Cloud Services

An Introduction to web scraping and to Serverless Cloud services.

Izzat Demnati
Aug 29, 2019 · 7 min read
Image for post
Image for post
Photo by Roman Kraft on Unsplash

The purpose of this article is to present au systematic approach to read an RSS News Feed and to process its content to web scrape news articles. The challenge is to be able to extract text articles published in different websites without any strong premise on a web page structure.

The overall solution is described in three steps :

  1. A message is published in Cloud Pub/Sub with a URL to an news RSS feed,
Image for post
Image for post
News RSS feed web scraping — Solution diagram

RSS Feed data processing

Let say we would like to listen to an RSS Feed news and loop on each link to extract web page articles.

Fortunately, RSS Feed is written in XML and must conform to specifications. So dealing with one or multiple RSS Feed should be the same. Here is an example of Google News RSS Feed when I searched for articles related to “Apple”.

Image for post
Image for post
RSS Feed example — XML file

As we see, each item is an article described by a title, a link to the original web page, a publication date and a description summary. Let see how to use Python to run a search and how to loop trough each item to gather article links.

First we need to install feedparser library to parse RSS Feed xml files:

pip install feed parser

We also need to install pandas to use pandas.io.json.json_normalize to flatten Json data to a pandas data table: pip install pandas

The output of the following code is a set of URLs that will be used in the next section to extract articles using a web scraping library.

RSS Feed data parser — Python code

Web scraping news articles

The objective is to extract articles from news website without strong dependency on any web page structure.

If we look at news websites page structure, we notice that the main article is surrounded by some noise like images, ads, links, etc. that we should ignore during the extraction process.

Here is an example of an article from the Guardian website:

Image for post
Image for post
Guardian example — Top Page
Image for post
Image for post
Guardian example — Bottom Page

If we look at the page source bellow we notice that the <article> HTML element is used to self-contain the main text which is spitted on multiple paragraphs <p> . This structure is common in news websites. However there is some exceptions so we don’t want to make strong assumptions on the page structure, to be able to parse any news web page.

The only assumption that we will work with is that an article is a concatenation of one or many paragraph blocks and each web page contains one main article that we want to extract.

Image for post
Image for post
Guardian example — source code

Here is another example from Reuters website where the <article> element is not used to self-contained the article but the previous assumption is still true:

The article of interest is a concatenation of paragraph blocks and each web page contains one main article.

We notice from the previous page source that, instead of using the <article> element, the element <div class=”ArticlePage_container> is used to group the main article paragraph blocks.

Image for post
Image for post
Reuters example — source code

So, we might consider extracting all paragraph elements within a page and concatenating them as a single article. However, not all paragraph blocks in a web page are related to the main article, so we should find a way to concatenate only paragraphs of interest.

The solution proposed is easy to implement, not perfect, but worked well when tested over different websites:

  1. extract all paragraph elements inside the page body,

The assumption is that the main article should represent the highest web page content, while the other web page components, like ads, images, links, promoted articles summaries, etc., should be individually marginal.

To implement this solution, lets install some useful libraries:

  • The requests library will be used to request http URLs: pip install requests

Here is a Python implementation in order to process one article:

web scraping — Python code

So now we can loop over each item within the RSS Feed result to extract all articles.

Automatization using Google Cloud Platform

In this section we will use Google Cloud Platform (GCP) to build a scalable solution for searching any news related to a keyword, extracting the news articles and storing them for future processing:

  1. Cloud Pub/Sub will be used to build an event-driven solution to start a search-and-extract process,

You can start using GCP for free with your Gmail account. You can get 300$ free credits that you can use for 12 months.

Cloud Pub/Sub

To start using Cloud Pub/Sub, we should create a topic. A topic is a resource that groups related messages exchanged between publishers and subscribers.

Is our case, we will publish messages to three topics :

  1. news_search_to_process: messages published in this topic will trigger a Cloud Function to process an RSS feed,

We can use the Google Cloud console to create a topic or we can use a command-line using the Google Cloud SDK that we can install and configure in our local environment:

gcloud pubsub topics create news_search_to_process
gcloud pubsub topics create news_article_to_process
gcloud pubsub topics create news_article_processed
Image for post
Image for post
Cloud Pub/Sub — Create a topic

We also need to define a message to be able to run news search for different RSS feeds or different keywords.

To keep the example simple, let publish a simple URL as a message attribute that will be used by a Cloud Function to start the process:

url = 'https://news.google.com/rss/search?q=apple'

We can use the console to manually publishing a message to a topic:

Image for post
Image for post
Cloud Pub/Sub — Topic details
Image for post
Image for post
Cloud Pub/Sub — Publish message to a topic

Cloud Function

In order to deploy a Cloud Function for our solution, we need to create these two files:

  1. main.py file that contains our python function,
google-cloud-storage
google-cloud-pubsub
pandas
feedparser
bs4

The main.py file contains :

  1. process_rss_feed_search(): This Cloud Function will be triggered when we will publish a message to the Cloud Pub/Sub topic news_search_to_process.

Now we need to deploy two Cloud Functions triggered by messages publish to the Cloud Pub/Sub topics. We can execute a command line using the Google Cloud SDK:

gcloud functions deploy process_rss_feed_search --runtime python37 --trigger-topic news_search_to_processgcloud functions deploy process_rss_feed_article --runtime python37 --trigger-topic news_article_to_process

We are now ready to test our solution by publishing a message to Cloud Pub/Sub topic news_search_to_process:

Image for post
Image for post
Message published to topic: news_search_to_process

Cloud Function process_rss_feed_search() ran successfully and 100 articles extracted from google News RSS Feed were processed:

Image for post
Image for post
Cloud Function process_rss_feed_search() logs

Cloud Function process_rss_feed_article() ran successfully 100 times to processed published articles:

Image for post
Image for post
Cloud Function process_rss_feed_article() logs

100 news articles were extracted from web pages and Json files were created in Cloud Storage:

Image for post
Image for post
Cloud Storage — News articles Json files

Here is a resulting Json file example:

Image for post
Image for post
News article — Json file example

I hope you enjoyed this introduction to web scraping and to Serverless Cloud services. In a future post, I will present a solution on how to extract entities from news articles, using Spacy python package, and how to securely expose a Web Service using Cloud Endpoints.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Izzat Demnati

Written by

Passionate about data science, I am eager to learn and share about innovation, technology and machine learning.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Izzat Demnati

Written by

Passionate about data science, I am eager to learn and share about innovation, technology and machine learning.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store