Data Scraping: A Quick, Basic Tutorial in Python
What is Data Scraping?
If you’re like me, you might have heard this term briefly thrown around in computer science or business contexts. So… what exactly IS data scraping and why has everybody been talking about it?
As every curious individual living in 2020 might do, I googled what it meant. Dictionary.com defines it as:
In short, data scraping is used to easily make vast amounts of data:
- Accessible
- Usable
- Aggregable
Why does this matter to Data Scientists?
A data scientist’s role is to tell stories through data. Much like how words fuel the ability of writers to articulate, data better equips data scientists with the ability to tell relevant, meaningful, yet holistic narratives about the past and the present to leverage a better future.
As the building blocks and the keystone of data science, data extracted from web scraping can be utilized in, to name a few:
- Natural Language Processing
- Machine Learning
- Data Visualization & Analysis
Industry Prominence & Usage
In a wider context, data-scraping is not a new practice. However, recent automation has enabled companies to truly harness its power. Data scraping can provide valuable insight on the customer experience, better inform business decisions & performance, and drive innovation at previously unattainable rates. It has found use in data analysis & visualization, research & development, and market analysis among other contexts.
The Basics
To perform data scraping, you must have a basic understanding of HTML structures. For those of you who aren’t very experienced with computer science — don’t fret! You just need enough to identify some simple HTML structures. Here are a few of the most commonly seen ones:
- Headings: Defined from h1 to h6 tags, most important to least important respectively.
Example: <h1 > This is the Heading shown! <h1> - Paragraphs: Defined with the <p> tag.
Example: <p> This is the Paragraph shown! </p> - Divisions: Defined by the <div> tag.
Example: <div> This is the container shown! </div>
The Process
Web Scraping can be Broken Down into 4 General Steps:
1. Finding the Desired URL to be Scraped
2. Inspecting the Page
3. Identifying Elements for Extraction
4. Extracting & Storing the Data
Getting Started !
First, make sure that you have Python and BeautifulSoup installed on your computer! These can all be directly downloaded online.
We will import BeautifulSoup for navigating the data and urlopen to extract the HTML.
from urllib.request import urlopenfrom bs4 import BeautifulSoup
1. Finding the Desired URL to be Scraped
First, identify the URL of the page you want to scrape. Here, we will be scraping data on a BBC article that was recently posted.
2. Inspecting the Page
Now, we’ll begin by getting the desired data and inspecting. Do this by right-clicking anywhere and selecting “Inspect Element”.
You should see a bunch of HTML code flood your screen under the elements tag.
Overwhelmed? DON’T PANIC!
3. Identifying elements for extraction
Now, we want to identify and navigate to the smallest division, shown as <div>, that contains all of the desired data to be scrapped. Recall the basic HTML structures mentioned above! Since we want to scrape the body of the article, we are looking for the <p> tag, which represents paragraphs.
After finding all of the desired data, we look for the most specific division, or <div>, that contains them. Because the elements are indented based on a hierarchy, we can see that all the desired data falls under:
4. Extracting & Storing the Data!
First, we want to connect to the website and retrieve the HTML data.
link = "https://www.bbc.com/news/world-52603017"try:page = urlopen(link)except:print("Error connecting to the URL")
A try/except is utilized here to catch an error thrown if the URL was not valid.
Now, in order to extract the desired data, we will create a BeautifulSoup object to parse the HTML data we just retrieved.
soup = BeautifulSoup(page, 'html.parser')
We identified the division, shown by the <div> tag, that contains our data earlier:
We will now parse and identify that specific division using the BeautifulSoup object’s find() function:
content = soup.find('div', {"class": "story-body__inner"})
Now we will iterate through the specific divider we just isolated above, identifying all of the paragraph text, which is indicated by a <p> within that division, using BeautifulSoup’s find_all() function:
news = ''for x in content.find_all('p'):news += ' ' + x.text
The above code stores the entire body of the article in the news variable, which can later be placed into a data frame alongside other extracted data!
That scraped data can be stored within a CSV. In this example, we will store the data stored in the news variable in a csv called “article.csv”.
with open('article.csv','w') as file:file.write(news)
And … just like that, you’ve performed some basic data scraping!
Ethics of Data Scraping
As with all things that have the potential to do great things, data scraping can also be utilized unethically, maliciously, and illegally. Data scraping has been used to plagiarize, spam, and even commit identity theft and fraud.
We are working with some powerful stuff here!
While the practice of data scraping is an ethical concept, novice and expert data scientists alike need to be aware of the implications of our actions — especially when regarding the safety and privacy of personal data.
- Use Public APIs, if available, instead of scraping
- Do not request large amounts of data that may overload servers or be perceived as a DDoS attack
- Respect other people’s work & do not steal content
With that in mind, good luck and happy scraping!