playgrdstar
quaintitative
Published in
1 min readJul 29, 2018

--

RSS Ingestion

If you don’t know what an RSS feed is (where have you been?), it’s a link which you can subscribe to get data in an structured format. Typically, we use RSS readers to subscribe to RSS feeds.

Getting data from RSS feeds can also be done through Python, and it’s a breeze (compared with HTM), as RSS feeds are typically very structured.

We first import the following libraries -

  • Beautiful Soup (bs4), to help us tease the good stuff out from HTML and XML
  • feedparser, to help us make fetch and parse RSS feeds
import bs4 
import requests

Next, specify the RSS feed where you would like to get data from, say the Kaggle RSS feed.

feeds = ['http://blog.kaggle.com/feed/']

Now, we get feeds from this source.

parsed = feedparser.parse(feeds[0])

Next, we break it up, to get the posts.

posts = parsed.entries

We can examine the first post.

posts[0]

We can also get the title of the first post.

posts[0].title

As usual, we use the BeautifulSoup library again to get easy access to the items on the page by their tags

html = posts[0].content[0].get('value')
soup = bs4.BeautifulSoup(html, 'html5lib')

Now, we can easily access the items in the page using their tags with the helper functions in the BeautifulSoup library

soup.find_all('h1')

The Jupyter notebook with the code is here

--

--

playgrdstar
quaintitative

ming // gary ang // illustration, coding, writing // portfolio >> playgrd.com | writings >> quaintitative.com