How to scrape a blog and collect its articles in Python

Andrea D'Agostino
6 min readMay 5, 2022

An easy and efficient paradigm for creating a corpus from online blog articles

Photo by Hal Gatewood on Unsplash

Creating a dataset is one of the first and most important phases of a data analytics and machine learning project. As I already talked about in one of my medium articles where I explain the importance of creating a dataset from scratch, here I focus on sharing with you readers a simple and effective method to populate a corpus of textual data using online blogs as a data source.

The motivation on my part that convinced me to write this piece is that textual data is certainly the most prevalent format of data online and it is surely a valid skill to be able to draw on this data pool. In this I will share how.

It’s worth mentioning that this method is not invasive (if you do not modify the code, of course). Remember to always scrape responsibly. If a website states it its terms of service that it doesn’t want to be scraped, then don’t.

Furthermore, this method is suitable for retrieving articles that can be found in the HTML — this means that if the content is generated via Javascript then this process won’t help you. In this case you have to emulate a browser with a tool like Playwright or Selenium.

Let’s get started.

How does the software…

--

--

Andrea D'Agostino

Data scientist. I write about data science, machine learning and analytics. I also write about career and productivity tips to help you thrive in the field.