How to build a RSS from scraping using Python

Samuel Viana
4 min readApr 15, 2022

--

The RSS feeds, used to deliver new and updated content directly to readers, without the need to visit their hosting website directly, have been discontinued in the majority of nowadays news sites. If you, like me use one of those feed readers like Feedly or the Old Reader, if you don’t have programming knowledge, will have to a third-party generator like rss.app .

But without two much knowledge you can build a RSS feed without not much knowledge of programming. For this tutorial I will be using Python (version 3, since the 2.x version is not supported anymore), the easiest of languages with a good relation of power/knowledge.

I’m a computer programming and I like to get updated about one major IT books publisher, O’Reilly, which used to provide a RSS feed but like many other, gave up on hosting it. But like in the example I’m going to show you, it’s easy to produce a RSS feed of your own.

The Latest Books, Reports, Videos, and Audiobooks — O’Reilly Media (oreilly.com)

Besides Python, I’ll be using the Developer Tools, which in this time in which 95% of people are using Chrome-like browsers like Edge, Vivaldi or Safari, is present in all of them. You have to right-click any area on any webpage to appear the context menu and on the bottom of it, you’ll find ‘Inspect’, clicking it will bring the web developer panel with the Element tab activated. (or using the keyboard, press <Ctrl>+<Shift>+<i>)

So, let’s go to O’Reilly Recently Published webpage and let’s click on the grid row of one book:

You’ll see an <a> element with the class book in it, this is what we want.

If we expand this HTML node we’ll see more stuff like:

Now we have a title, the author, and the description. We have just grabbed something to start working on.

To do scraping, is like opening a page, but instead of rendering in a browser it will open in a memory buffer, in which we can do operations like searching the content and slice out the content we want.

To start to fetch the webpage content, we’ll need the requests package, which does not come by default in the default Python package, so we have to install it:

pip install requests

to grab the page do:

r = requests.get(‘https://www.oreilly.com/products/books-videos.html’)

html = r.text

so the html variable contains a memory buffer with the webpage contents in it, our next step is to walk over the content and extract just the bits of information we want, for that we’ll need the BeautifulSoup library, so install it:

pip install beautifulsoup4

which will install the version of the library specific for Python 3.

Now we’ll start instantiating the parser variable using the html string:

from bs4 import BeautifulSoup as bs

parser =bs(html,’html.parser’)

The ‘html.parser’ literal is specifying that we will use an HTML parser, we could be using another parser, int the case we were be parsing XML instead of HTML, see the BeautifulSoup docs for more details.

And next we’ll do:

books=html.find_all(class_=’book’)

which will extract a list of books from the whole of the page. And so, our next step is to iterate over this list grabbing the contained bits we want for each book:

for book in books:

link = ‘https://oreilly.com'+book['href']

title = book.find(class_=’searchList-title’).string

author = book.find(class_=’searchList-name’).string

description = book.find(class_=’searchList-description’).string

image = book.find(class_=’searchList-cover’).find(‘img’)[‘src’]

The BeautifulSoup library makes very easy the task of extracting the data we need. To get an HTML element attribute we do:

element[‘attribute_name’]

, to get the text inside an element:

element.string

This way it’s extremely easy to fetch what we want, searching for the elements inside other element using the find method using the class names which describe the content or the element names.

Now we have just the data, we need, we can start to build the RSS. For that, we’ll be using the rfeed library:

pip install rfeed

The header of the feed is constructed this way:

import rfeed

feed = rfeed.Feed(title=”O\’Reilly Last Books”,

description = ‘The last books by O\’Reilly’,

language=’en-US’,

items=items_,

link=OREILLY_URL

)

the variable items_ is used to receive the metadata for each book, in this way:

item = rfeed.Item(

title=title,

link=link,

description = description,

author = author,

guid = rfeed.Guid(link),

enclosure=rfeed.Enclosure(url=image,type=’image/jpeg’,length=0)

)

items_.append(item)

And in the end, we’ll just have to do this to obtain the XML:

rss = feed.rss()

Next step it will to deploy the code to the server, using mod_python or just CGI, generating the RSS on the fly.

In a next article, I’ll be explaining how to extract information for heavily coded Javascript pages, in which the content is dynamic, that does not come in the initial HTML document sent by the server.

Until then, see ya!

--

--