Extract RSS News Feeds using Python and Google Cloud Services
An Introduction to web scraping and to Serverless Cloud services.
The purpose of this article is to present au systematic approach to read an RSS News Feed and to process its content to web scrape news articles. The challenge is to be able to extract text articles published in different websites without any strong premise on a web page structure.
The overall solution is described in three steps :
- A message is published in Cloud Pub/Sub with a URL to an news RSS feed,
- A first Cloud Function is triggered by the previous message. It extracts each article within the RSS feed, stores it in Cloud Storage and publishes a message for each article in Cloud Pub/Sub for further usage,
- A second Cloud Function is triggered by the previous messages. It web scrapes the article page, stores the resulting text in Cloud Storage and publishes a message in Cloud Pub/Sub for further usage.
RSS Feed data processing
Let say we would like to listen to an RSS Feed news and loop on each link to extract web page articles.
Fortunately, RSS Feed is written in XML and must conform to specifications. So dealing with one or multiple RSS Feed should be the same. Here is an example of Google News RSS Feed when I searched for articles related to “Apple”.
As we see, each item is an article described by a title, a link to the original web page, a publication date and a description summary. Let see how to use Python to run a search and how to loop trough each item to gather article links.
First we need to install feedparser library to parse RSS Feed xml files:
pip install feed parser
We also need to install pandas to use pandas.io.json.json_normalize to flatten Json data to a pandas data table:
pip install pandas
The output of the following code is a set of URLs that will be used in the next section to extract articles using a web scraping library.
Web scraping news articles
The objective is to extract articles from news website without strong dependency on any web page structure.
If we look at news websites page structure, we notice that the main article is surrounded by some noise like images, ads, links, etc. that we should ignore during the extraction process.
Here is an example of an article from the Guardian website:
Apple ends contracts for hundreds of workers hired to listen to Siri
Hundreds of Apple workers across Europe who were employed to check Siri recordings for errors have lost their jobs…
If we look at the page source bellow we notice that the <article> HTML element is used to self-contain the main text which is spitted on multiple paragraphs <p> . This structure is common in news websites. However there is some exceptions so we don’t want to make strong assumptions on the page structure, to be able to parse any news web page.
The only assumption that we will work with is that an article is a concatenation of one or many paragraph blocks and each web page contains one main article that we want to extract.
Here is another example from Reuters website where the <article> element is not used to self-contained the article but the previous assumption is still true:
The article of interest is a concatenation of paragraph blocks and each web page contains one main article.
We notice from the previous page source that, instead of using the <article> element, the element
<div class=”ArticlePage_container> is used to group the main article paragraph blocks.
GlobalFoundries sues TSMC, wants U.S. import ban on some products
SHANGHAI (Reuters) - Contract chipmaker GlobalFoundries sued larger rival and Apple supplier TSMC for patent…
So, we might consider extracting all paragraph elements within a page and concatenating them as a single article. However, not all paragraph blocks in a web page are related to the main article, so we should find a way to concatenate only paragraphs of interest.
The solution proposed is easy to implement, not perfect, but worked well when tested over different websites:
- extract all paragraph elements inside the page body,
- for each paragraph, construct its patents elements hierarchy,
- concatenate paragraphs under the same parent hierarchy,
- select the longest paragraph as the main article.
The assumption is that the main article should represent the highest web page content, while the other web page components, like ads, images, links, promoted articles summaries, etc., should be individually marginal.
To implement this solution, lets install some useful libraries:
- The requests library will be used to request http URLs:
pip install requests
- The beautifulsoup4 library will be used to extract data from an html content type:
pip install beautifulsoup4
- The pandas library will be used to collect articles in DataFrame format:
pip install pandas
Here is a Python implementation in order to process one article:
So now we can loop over each item within the RSS Feed result to extract all articles.
Automatization using Google Cloud Platform
In this section we will use Google Cloud Platform (GCP) to build a scalable solution for searching any news related to a keyword, extracting the news articles and storing them for future processing:
- Cloud Pub/Sub will be used to build an event-driven solution to start a search-and-extract process,
- Cloud Function will be used to make our Python code available as a service,
- Cloud Storage will be used to persist the resulting articles as Json files.
You can start using GCP for free with your Gmail account. You can get 300$ free credits that you can use for 12 months.
To start using Cloud Pub/Sub, we should create a topic. A topic is a resource that groups related messages exchanged between publishers and subscribers.
Is our case, we will publish messages to three topics :
- news_search_to_process: messages published in this topic will trigger a Cloud Function to process an RSS feed,
- news_article_to_process: messages published in this topic will trigger a Cloud Function to web scrape a news web page,
- news_article_processed: messages published in this topic will notify subscribers for further processing.
gcloud pubsub topics create news_search_to_process
gcloud pubsub topics create news_article_to_process
gcloud pubsub topics create news_article_processed
We also need to define a message to be able to run news search for different RSS feeds or different keywords.
To keep the example simple, let publish a simple URL as a message attribute that will be used by a Cloud Function to start the process:
We can use the console to manually publishing a message to a topic:
In order to deploy a Cloud Function for our solution, we need to create these two files:
- main.py file that contains our python function,
- requirements.txt file that lists the python libraries that we need to install. For our purposes, it will contain the following libraries:
The main.py file contains :
- process_rss_feed_search(): This Cloud Function will be triggered when we will publish a message to the Cloud Pub/Sub topic news_search_to_process.
- process_rss_feed_article(): This Cloud Function will be triggered when we will publish a message to the Cloud Pub/Sub topic news_article_to_process.
- news_feed_parser: This class is used to to parse RSS news feeds URLs and to web scrape news web pages.
- gcp_cloud_util: This class is used to connect to Google Cloud Platform and to encapsulate GCP services as Cloud Storage and Cloud Pub/Sub in this case.
Now we need to deploy two Cloud Functions triggered by messages publish to the Cloud Pub/Sub topics. We can execute a command line using the Google Cloud SDK:
gcloud functions deploy process_rss_feed_search --runtime python37 --trigger-topic news_search_to_processgcloud functions deploy process_rss_feed_article --runtime python37 --trigger-topic news_article_to_process
We are now ready to test our solution by publishing a message to Cloud Pub/Sub topic news_search_to_process:
Cloud Function process_rss_feed_search() ran successfully and 100 articles extracted from google News RSS Feed were processed:
Cloud Function process_rss_feed_article() ran successfully 100 times to processed published articles:
100 news articles were extracted from web pages and Json files were created in Cloud Storage:
Here is a resulting Json file example:
I hope you enjoyed this introduction to web scraping and to Serverless Cloud services. In a future post, I will present a solution on how to extract entities from news articles, using Spacy python package, and how to securely expose a Web Service using Cloud Endpoints.