Create your own newsletter from any source with python — Part 1 (Web Scraping)

Newsletters can be very annoying sometimes but they can also be useful: Imagine if you can get updates on anything you really care about in a single email that YOU control.

Fabio Magarelli
Analytics Vidhya
4 min readJan 25, 2021

--

Photo by Kristina Tripkovic on Unsplash

This is exactly the aim of this little tutorial divided into 2 parts: in part 1 we’ll learn how to make a web scraper that pulls and filters information from any website. In part 2, we’ll find out how to create a newsletter using the data scraped and send it via email to yourself or your subscribers / friends.

If you’re looking for the part 2 of this tutorial, follow the link: https://medium.com/analytics-vidhya/create-your-own-newsletter-from-any-source-with-python-part-2-the-newsletter-ed21cd47c788

The full code is available here: https://github.com/fabiom91/python-newsletter_tutorial

Web Scraper

For the web scraper we will use two very useful Python libraries: requests and beautifulsoup :

Before we proceed, we have to have a look at what we have scraped from the target website, which, in my case:

Ugh my head’s gonna explode! 🤯

I know, it doesn’t look pretty at all but keep in mind that once you extract the data you need, you don’t have to look at it anymore. In my case I already figured out from the screenshot above that the link to an article contains its title.

Let’s say I want to get all articles about privacy. We can try this:

As you can see there is still a lot of nonsense but we can start recognising some URLs. Let’s clean this up by extracting only the strings between the “href” tag. To do this we first have to find the position where each “href” element is:

Once we know the position of each “href” element, we can get the contents:

Still not quite there yet but we can clearly see that if we discard the query parameters (all that comes after “?”), we only have one URL in our search so let’s clear it up:

Perfect! You may have noticed that what I’ve been calling “URL” so far is actually a relative path which is missing the domain name etc. To fix this we can just have a look at any of the articles published in the Medium wall to see that the complete URL is made by prefixing the relative path with the wall URL:

Assuming that I’m interested in articles about privacy and data analytics, I can get all articles that have in their titles some keywords such as “data” or “privacy” etc. To search for all the links containing any of the chosen keywords, I can use Regular expressions (Regex):

Now that we’ve seen how to extract the URL of the articles we may want to get some more info from within the articles to populate our newsletter. To do so we can apply the same process we just used to each of the article URLs.

In this example case, we want to extract:

  • Article title
  • Article subtitle (first paragraph)
  • Article picture (if any)

By analysing the source code of each article I’ve identified the title is always in the first “h1” tag while as image we can use the first one in the “img” tag after the title. As per the subtitle, I’m going to use the first paragraph “p” of the article:

For this example we are going to use only this data but if you’re interested you can scrape different websites then combine all the data together in a single newsletter.

Congratulations on finishing part 1 of this tutorial. To learn how to make a newsletter with the scraped data and how to send it automatically to yourself or your friends, checkout part 2 of this tutorial here:

https://medium.com/analytics-vidhya/create-your-own-newsletter-from-any-source-with-python-part-2-the-newsletter-ed21cd47c788

--

--

Fabio Magarelli
Analytics Vidhya

PhD student at the Centre for Research Training in Artificial Intelligence, University College Cork.