How to scrape a blog and collect its articles in Python

6 min readMay 5, 2022

An easy and efficient paradigm for creating a corpus from online blog articles

Creating a dataset is one of the first and most important phases of a data analytics and machine learning project. As I already talked about in one of my medium articles where I explain the importance of creating a dataset from scratch, here I focus on sharing with you readers a simple and effective method to populate a corpus of textual data using online blogs as a data source.

The motivation on my part that convinced me to write this piece is that textual data is certainly the most prevalent format of data online and it is surely a valid skill to be able to draw on this data pool. In this I will share how.

It’s worth mentioning that this method is not invasive (if you do not modify the code, of course). Remember to always scrape responsibly. If a website states it its terms of service that it doesn’t want to be scraped, then don’t.

Furthermore, this method is suitable for retrieving articles that can be found in the HTML — this means that if the content is generated via Javascript then this process won’t help you. In this case you have to emulate a browser with a tool like Playwright or Selenium.

Let’s get started.

How to scrape a blog and collect its articles in Python

How does the software…

Written by Andrea D'Agostino