Data Scraping: A Quick, Basic Tutorial in Python

Christos Chen
Analytics Vidhya
Published in
5 min readMay 11, 2020
Source: SSA-Data

What is Data Scraping?

If you’re like me, you might have heard this term briefly thrown around in computer science or business contexts. So… what exactly IS data scraping and why has everybody been talking about it?

As every curious individual living in 2020 might do, I googled what it meant. Dictionary.com defines it as:

Source: Dictionary.com

In short, data scraping is used to easily make vast amounts of data:

  • Accessible
  • Usable
  • Aggregable

Why does this matter to Data Scientists?

A data scientist’s role is to tell stories through data. Much like how words fuel the ability of writers to articulate, data better equips data scientists with the ability to tell relevant, meaningful, yet holistic narratives about the past and the present to leverage a better future.

As the building blocks and the keystone of data science, data extracted from web scraping can be utilized in, to name a few:

  • Natural Language Processing
  • Machine Learning
  • Data Visualization & Analysis

Industry Prominence & Usage

In a wider context, data-scraping is not a new practice. However, recent automation has enabled companies to truly harness its power. Data scraping can provide valuable insight on the customer experience, better inform business decisions & performance, and drive innovation at previously unattainable rates. It has found use in data analysis & visualization, research & development, and market analysis among other contexts.

Source: VIRTUALlytics

The Basics

To perform data scraping, you must have a basic understanding of HTML structures. For those of you who aren’t very experienced with computer science — don’t fret! You just need enough to identify some simple HTML structures. Here are a few of the most commonly seen ones:

  • Headings: Defined from h1 to h6 tags, most important to least important respectively.
    Example: <h1 > This is the Heading shown! <h1>
  • Paragraphs: Defined with the <p> tag.
    Example: <p> This is the Paragraph shown! </p>
  • Divisions: Defined by the <div> tag.
    Example: <div> This is the container shown! </div>

The Process

Web Scraping can be Broken Down into 4 General Steps:

1. Finding the Desired URL to be Scraped

2. Inspecting the Page

3. Identifying Elements for Extraction

4. Extracting & Storing the Data

Getting Started !

First, make sure that you have Python and BeautifulSoup installed on your computer! These can all be directly downloaded online.

We will import BeautifulSoup for navigating the data and urlopen to extract the HTML.

from urllib.request import urlopenfrom bs4 import BeautifulSoup

1. Finding the Desired URL to be Scraped

First, identify the URL of the page you want to scrape. Here, we will be scraping data on a BBC article that was recently posted.

2. Inspecting the Page

Now, we’ll begin by getting the desired data and inspecting. Do this by right-clicking anywhere and selecting “Inspect Element”.

You should see a bunch of HTML code flood your screen under the elements tag.

Overwhelmed? DON’T PANIC!

3. Identifying elements for extraction

Now, we want to identify and navigate to the smallest division, shown as <div>, that contains all of the desired data to be scrapped. Recall the basic HTML structures mentioned above! Since we want to scrape the body of the article, we are looking for the <p> tag, which represents paragraphs.

We’ve found the article body, the <p>s!

After finding all of the desired data, we look for the most specific division, or <div>, that contains them. Because the elements are indented based on a hierarchy, we can see that all the desired data falls under:

As shown above, this division contains all of the desired article we want!

4. Extracting & Storing the Data!

First, we want to connect to the website and retrieve the HTML data.

link = "https://www.bbc.com/news/world-52603017"try:page = urlopen(link)except:print("Error connecting to the URL")

A try/except is utilized here to catch an error thrown if the URL was not valid.

Now, in order to extract the desired data, we will create a BeautifulSoup object to parse the HTML data we just retrieved.

soup = BeautifulSoup(page, 'html.parser')

We identified the division, shown by the <div> tag, that contains our data earlier:

We will now parse and identify that specific division using the BeautifulSoup object’s find() function:

content = soup.find('div', {"class": "story-body__inner"})

Now we will iterate through the specific divider we just isolated above, identifying all of the paragraph text, which is indicated by a <p> within that division, using BeautifulSoup’s find_all() function:

news = ''for x in content.find_all('p'):news += ' ' +  x.text

The above code stores the entire body of the article in the news variable, which can later be placed into a data frame alongside other extracted data!

That scraped data can be stored within a CSV. In this example, we will store the data stored in the news variable in a csv called “article.csv”.

with open('article.csv','w') as file:file.write(news)

And … just like that, you’ve performed some basic data scraping!

Ethics of Data Scraping

As with all things that have the potential to do great things, data scraping can also be utilized unethically, maliciously, and illegally. Data scraping has been used to plagiarize, spam, and even commit identity theft and fraud.

We are working with some powerful stuff here!

While the practice of data scraping is an ethical concept, novice and expert data scientists alike need to be aware of the implications of our actions — especially when regarding the safety and privacy of personal data.

  • Use Public APIs, if available, instead of scraping
  • Do not request large amounts of data that may overload servers or be perceived as a DDoS attack
  • Respect other people’s work & do not steal content

With that in mind, good luck and happy scraping!

--

--

Christos Chen
Analytics Vidhya

An individual who aspires to utilize data to tell relevant and important narratives within the numbers to catalyze meaningful change in the world.