How to scrape articles from data science publications using Python

What data science articles attract more attention (Part 1)

TU
Analytics Vidhya
7 min readJan 17, 2021

--

Photo by Elena Mozhvilo on Unsplash

Have you ever wondered, what makes an article great? Are there specific areas in data science world readers more interested in? I certainly did! I am aiming to find answers to these questions by analysing articles from data science publications on Medium. The series of articles will cover areas such as web scraping, cleansing text data and topic modelling.

In the part 1 of the series we will obtain historical articles from various data science publications by means of web scraping using Python.

Data science publications

There are quite a few publications on Medium that cover topics like data science and programming. We will obtain articles from three publications, TDS, Towards AI and Analytics Vidhya. This should provide large enough data set for our future analysis.

Navigating to archives

By the means of simple google search we can get the link for the archive of publication we are interested in. Almost all publications have the url of the following form ‘https://medium.com/publication-name/archive’ with the exception of TDS. However the layout of the page remains unchanged.

Let’s inspect archive pages of publications, on top of the page we can find years when articles were published, by clicking on one of the years we will see months and similarly to get days we need to click on specific month. Now, one thing to note is that not all years have months and not all months have days.

https://towardsdatascience.com/archive

How do we get links to all the dates?

We start by inspecting the archive page, we can see that the links are stored in ‘div’ container with class ‘timebucket…’. The only difference in naming convention of class for three containers is the width.

Inspecting Analytics Vidhya archive page.

Now that we have all information needed to obtain the links we can write the code. Let’s start with the imports necessary for this task, we will use Python libraries requests and BeautifulSoup to make HTTP requests and extract the data from HTML.

from bs4 import BeautifulSoup
import requests

Next we define function called get_all_links it uses url of publication archive as an input and returns links for all the dates. It is split into three parts — get years, get months and get days. In the code we also have additional variables such as years_no_months and all_links_no_days, they are there to collect links for the years that didn’t have months and for months that didn’t have days.

def get_all_links(url):
'''function to obtain all the links to archive pages'''
# url - url to publication archive
r = requests.get(url)
# get years
soup = BeautifulSoup(r.text, 'html.parser')
search = soup.find_all('div', class_='timebucket u-inlineBlock u-width50')
years = []
for h in search:
years.append(h.a.get('href'))

# get months
years_months = []
years_no_months = [] # for the years that don't have months
for year in years:
y_soup =BeautifulSoup(requests.get(year).text,'html.parser')
search_months = y_soup.find_all('div', class_='timebucket u-inlineBlock u-width80')
months = []
if search_months:
for month in search_months:
try:
months.append(month.a.get('href'))
except:
pass
years_months.append(months)
else:
years_no_months.append(year)
years_months = [item for sublist in years_months for item in sublist]

# get days
all_links = []
all_links_no_days = [] # for the month that don't have days
for month_url in years_months:
m_soup = BeautifulSoup(requests.get(month_url).text, 'html.parser')
all_days = m_soup.find_all('div', class_='timebucket u-inlineBlock u-width35')
days = []
if all_days:
for day in all_days:
try:
days.append(day.a.get('href'))
except:
pass
all_links.append(days)
else:
all_links_no_days.append(month_url)
all_links = [item for sublist in all_links for item in sublist]
final_links = years_no_months+all_links_no_days+all_links
return final_links
towards_ai_links = get_all_links('https://medium.com/towards-artificial-intelligence/archive')

The final output is a list of links to all the pages in the publication archive.

['https://medium.com/towards-artificial-intelligence/archive/2015', ...]

How do we get data for individual articles?

Inspecting the page is a good starting point for this task as well. All article boxes have standard format, thus obtaining information on one box will allow us to collect everything we need on all other articles.

Before we can write the code we need to establish elements and their classes where information we need is stored. At this point we are only interested to get title, subtitle, number of claps and responses for each article.

Inspecting individual articles

We now have all the information to write the code, we start with smaller snippets of the main script. To get all the articles from the single link, we search for ‘div’ containers with class ‘streamItem …’.

link = 'https://medium.com/towards-artificial-intelligence/archive/2015'soup_ = BeautifulSoup(requests.get(link).text, 'html.parser')
articles = soup_.find_all('div', class_='streamItem streamItem--postPreview js-streamItem')

Next we obtain title and subtitle for a single article, in the code below we have two elements for the title ‘h3’ and ‘h2’, this is due to some authors choosing a different approach to writing the title. For the case where the title or subtitle is missing we set it to be empty, however the value can be changed to anything.

article = articles[0] #get single article
if article.h3:
title = article.h3.getText()
elif article.h2:
title = article.h2.getText()
else:
title = ''
if article.h4:
subtitle = article.h4.getText()
else:
subtitle = ''

To obtain claps and responses we find corresponding elements with their class, in this case we are not just interested in text we need to get integer values. At times claps for individual articles reach thousands, in this circumstance they are expressed in following form ‘1k’, ‘2.2k’ and etc, to deal with this we replace ‘K’ with empty entry, transform string to integer and multiply it by 1000. For all the other values we either transform it to integer or set it to zero.

When an article has some responses from the readers, they come in the following format ‘1 response’, ‘2 responses’ and etc. So here, we just want to obtain the integer, we do so by searching ‘\d+’ regular expression in the string.

import re
s_clap =article.find('button', class_='button button--chromeless u-baseColor--buttonNormal js-multirecommendCountButton u-disablePointerEvents')
if s_clap:
s_clap = s_clap.getText()
if 'K' in s_clap:
clap = int(float(s_clap.replace('K', '')))*1000
else:
clap = int(s_clap)
else:
clap = 0
s_response = article.find('a', class_='button button--chromeless u-baseColor--buttonNormal')
if s_response:
s_response = s_response.getText()
response =int(re.search(r'\d+', s_response).group()))
else:
response = 0

Now we integrate all the snippets of code into one function called get_data_all_articles, its main argument is the output of get_all_links function we covered earlier (links to all archive pages of the single publications).

from tqdm import tnrange #to keep track of the progress 
import pandas as pd
import re
def get_data_all_articles(final_links):
titles, sub_titles, claps, responses = [], [], [], []

for link, z in zip(final_links, tnrange(len(final_links))):
soup_ =BeautifulSoup(requests.get(link).text, 'html.parser')
articles = soup_.find_all('div', class_='streamItem streamItem--postPreview js-streamItem')
title, subtitle, clap, response = [], [], [], []
for article in articles:
if article.h3:
title.append(article.h3.getText())
elif article.h2:
title.append(article.h2.getText())
else:
title.append('')
if article.h4:
subtitle.append(article.h4.getText())
else:
subtitle.append('')
s_clap =article.find('button', class_='button button--chromeless u-baseColor--buttonNormal js-multirecommendCountButton u-disablePointerEvents')
if s_clap:
s_clap = s_clap.getText()
if 'K' in s_clap:
clap.append(int(float(s_clap.replace('K', '')))*1000)
else:
clap.append(int(s_clap))
else:
clap.append(0)
s_response = article.find('a', class_='button button--chromeless u-baseColor--buttonNormal')
if s_response:
s_response = s_response.getText()
response.append(int(re.search(r'\d+', s_response).group()))
else:
response.append(0)
titles.append(title)
sub_titles.append(subtitle)
claps.append(clap)
responses.append(response)
titles = [item for sublist in titles for item in sublist]
sub_titles = [item for sublist in sub_titles for item in sublist]
claps = [item for sublist in claps for item in sublist]
responses = [item for sublist in responses for item in sublist]
frame = pd.DataFrame([titles, sub_titles, claps, responses]).transpose()
frame.columns = ['Title', 'Subtitle', 'Claps', 'Responses']
return frame
data_set = get_data_all_articles(towards_ai_links)

Running the script might take some time depending on the number of HTTP requests, each page contains 10 articles and some publications have quite large archives. The script includes tqdm package to keep track of the progress.

After running the script we get following data set

Final data set for Towards AI publication

Wrap up

We started our journey on uncovering what areas in data science are more interesting to the reader by obtaining the data set from the archives of some of the publications. The information on the article is not limited to the features we covered, there are plenty more metrics to obtain, such as whether an author had an image, how long is the article and etc. Which can be done in the similar fashion to obtaining title, subtitle, claps and responses.

Keep an eye for further articles in this series were we continue with data cleansing and topic modelling.

--

--