Scrape and Feed with Python

How to scrape information on a page and serve them as an RSS feed?

Enrico Bergamini
mai più senza

--

This Python tutorial explains how scrape the information published on a webpage and transform them in a smart and clean RSS feed.

You can find the same version of this program, written for Bash by my friend Andrea Borruso, here.

Let’s begin from this trial page I’ve built. What we want to do is transform all the different paragraphs, titles and dates into a valid RSS feed.

In order to do so I will use two libraries: lxml for the scraping and Yattag for generating the XML code of the feed.

Here’s the full code, I go through it in the comments. Here you find it in a github gist.

If you want to check the webpage periodically to generate a dynamic feed, in order to check for changes in the HTML (and update the feed accordingly) you would only need, in Linux, to use cron. This utility works as a time schedule, you can modify it by opening your terminal and type:

crontab -e

In the editor you can add a simple line that will make your machine execute the Python script at a given time schedule:

0 */2 * * * root /var/projects/my_scrape_and_feed_code.py

Where this, for example, would mean “At minute 0 past every 2nd hour.” I guess you might find crontab.guru very helpful to learn how to you set crontab properly, as Andrea suggests in his post.

I you have suggestions and comments feel free to leave them here or contact me or Andrea!

--

--

Enrico Bergamini
mai più senza

Ferrarese a Bruxelles, orgogliosamente emiliano. Amo Internet nelle sue declinazioni. Ogni tanto scrivo. Faccio cose con i dati.