Web Scraping 101 (Part 1)
A beginner’s guide to gathering data on the web!
With the rapid-expansion of data science, there are now millions of datasets available for download online. As a beginner, you may have encountered platforms that hosted these kinds of data such as Kaggle or Google’s Datasetsearch.
While these sources can prove to be extensive, you may be left looking for a specific kind of data for a personal project that may not be readily available yet. If you’ve encountered these potential roadblocks, Web Scraping may be the answer to your problems!
A very quick introduction to Web Scraping
If you would want to find out more about what web scraping exactly is, feel free to read this Wikipedia article to get a (not-so) concise definition of Web Scraping.
To demonstrate how to build a simple scraper, I will be showing you how to build a scraper that gathers medium articles in a .csv file.
I would highly recommend you to be familiar with HTML as it will make the process of building a scraper much faster.
Exploring the Medium Site
Before you can start writing code, you first need to find where the data is actually located. After navigating through the Medium homepage, you can find the Medium Archives page.
This is an example of what the page looks like. For this demonstration, I’ve decided to use the tag Covid-19 to find articles related to the pandemic. Feel free to check out the archives page: Archive of stories about Covid-19 — Medium.
We will be using two main libraries in helping is gather the data, namely: BeautifulSoup4 and Requests.
Identifying a few patterns is essential in making sure I get the desired data. For this project, the target is to get the text-content of the articles themselves and store other prominent data such as the title, tag, and the date the article is written.
Finding Article Titles and Links
You can do this by right-clicking the webpage and selecting Inspect Element or Inspect depending on what browser you are using. Additionally, you can access this menu by clicking Ctrl+Shift+I if you are using Windows.
To search for specific text within the HTML markup, you can use the Select Element button at the upper-left corner; or you can choose to toggle this by using the Ctrl+Shift+C if you are on Windows.
Using these tools, you can infer that each article is stored at a container (by using a <div> tag) and these containers hold information like the link to the article itself and the titles of each article.
More details on this to be discussed along with the python code.
Finding Article Publish Date
Clicking on dates under the Archive tab also changes the link of the webpage to match the dates. We can extract these dates and associate it to a specific article.
The General Plan
Knowing what we know from exploring the Medium website, to achieve retrieving article text data, one of the ways that this can be done is to scrape the links of the articles within each archive.
After achieving this, we can proceed to going over each link and scraping the actual text content from these links.
For this example, we would be trying to scrape Covid-19 related articles on the months of March and April.
The Link Scraper
Before attempting to scrape all of the data that you may be needing, I highly recommend testing your code on one specific test case so it would be easier to debug it. (As an example of a test case, it would be scraping an article from a specific date: March 1, 2020)
import requests
from bs4 import BeautifulSoup
import pandas as pdURL = https://medium.com/tag/covid19/archive/2020/03/01
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'lxml')
The code snippet outputs the HTML markup of the archives page of March 1, 2020. Now that we have written the code for getting the content for the archives, we would need to filter the HTML to find the relevant data that was discussed in the previous section.
To do this, we would need to create a filter that looks for the <div> containers that hold each article. This is where getting familiar with HTML proves to be an essential skill. We would need to find a unique identifier, may it be a set of tags or a class, that distinguishes the article containers from different parts of the archives page.
For this case, I was able to identify “cardChromeless” as one of the classes that identifies the article container. I iterated over each article to find other data relevant to the links through the <a> tags, the title through the <h3> tags, and the “post_id”, which acts as the unique identifier for each posts (to remove duplicate articles in the future if you plan on scraping articles that cover more than one tag) through the attribute “data-post-id” inside the <a> tag.
Feel free to copy this code in your editor and see the different kinds of outputs in the code by printing the different variables present in the scraper.
Now that we are able to create a scraper for an archive in a specific date, we can now generalize this code to work with different tags and dates by slightly modifying our working code.
Changes include adding the data that we retrieve to a .csv file and iterating over different possible dates to cover the entire archive if we choose to select multiple years, months, or specific days. The iterator at line 10 creates string formatted days that fit the URL.
This guide only covers the basics of scraping, there are ways to make the code run faster on your local machine through concepts such as multiprocessing or asynchronous functions.
Feel free to check out this guide on asynchronous functions or watch this video that explains multiprocessing and similar concepts to it.
Stay tuned for Part 2 of this Web Scraping series. Where we would extract the article data itself by going over the links.
Thanks for reading through this article, if you have any comments regarding code or explanations throughout this guide, feel free to comment down below your thoughts and feedback!