How to Legally Scrape the Web for Your Next Data Science Project

A quick introduction to web scraping and how to do it legally.

Published in

Modern Programmer

5 min readApr 6, 2021

With exabytes (that’s one quintillion bytes, by the way) of data being produced each day, web scraping is becoming a popular and essential tool for any current or future data scientist.

However, web scraping isn’t as simple as it looks.

The biggest issue surrounding web scraping is the legality of accessing a website’s information. Lawsuits surrounding the illegal scraping of a website’s content have been around since the early 2000s, with big names such as Facebook and LinkedIn being at the forefront of well-publicized legal battles.

With that being said, web scraping is not by nature illegal, if done correctly. This means data scientists need to use an additional layer of attentiveness and judgment when completing data science projects to avoid making any legal faux pas.

Let’s take a look at what web scraping is, when it becomes an illegal practice, and how best to carry out legal web scraping practices.

*A disclaimer: I am not a lawyer and this article should in no way be used as legal advice. This article is for information and entertainment purposes only and should not be considered a perfect account of how to legally scrape the web. The information in this article was curated from several sources. If in doubt about the legality of a web scraping project, contact a lawyer or the owner of the website you plan on scrapping.*

What is web scraping?

In short, web scraping is the process of extracting structured data from third-party websites. Web scraping is generally used to gather data from a website without also gathering unwanted or unrelated information. While the process can be completed by humans, it’s most often done by creating software comprised of two parts: the crawler and the scraper.

The crawler component of the software is similar to most web crawlers in that it scours the website of choice and picks out the content of interest. This content then gets passed on to the scraper.

The scraper component is programmed to find relevant information and to “bookmark” it using data locators. Once complete, the scraper will extract all of the data it has bookmarked and then transfers it to storage in a spreadsheet or database for later analysis.

Here is what the process looks like:

The crawler identifies target websites.
The crawler collects URLs of the web pages you want to use for data extraction.
The scraper requests the URLs to get the HTML code of the page.
The scraper “bookmarks” relevant data using data locators in the HTML code.
The scraper saves the relevant data in JSON, CSV, or some other type of format for later use.

A simple way to look at the web scraping process is to compare it to times you’ve copied and pasted information from a website. Essentially, you’re doing the same task as a web scraper, just in a manual fashion.

Web scraping is a common practice in data science and has many practical applications including stock price analysis, market research, housing price monitoring, and many more.

When does web scraping become illegal?

No matter where you look you’ll find instances of web scraping gone wrong.

A short example of this is in Canada, where a 2019 article in Canadian Lawyer depicts the legal battle between Mongohouse and the Toronto Real Estate Board (TREB). The TREB filed a successful lawsuit against Mongohouse alleging that Mongohouse’s entire business was based on its web scraping of the TREB MLS system and the unauthorized distribution of TREB MLS information for commercial purposes. In short, the Federal Court made it clear in its ruling that the unauthorized web scraping of third-party content without explicit consent is illegal.

This is an extreme example, but it gives insight into how the world of web scraping is much more complicated than it appears.

While reading a website’s Terms and Conditions can give you some information on whether or not they allow web scraping or even which parts of their website are allowed to be scrapped, it’s nearly impossible to truly know if you’re on the right side of the law without a lawyer. Furthermore, there is a whole slew of legal and ethical considerations to take into account that could unknowingly make your scraping project illegal.

Therefore, it’s important to tread lightly.

How to legally scrape the web.

Web scraping is legal as long as you play by the rules.

So, it’s important to follow the golden rule of web scraping: when in doubt, ask.

I can’t stress this enough. If there is any doubt in your mind of the legality of scraping a particular website, contact the owner of the website and ask for permission.

Keep these key points in mind when scraping the web:

Check Robots.txt before scraping a website. Robots.txt will inform you on which parts of a website you can scrape and which parts you need to avoid.
Don’t harm the website or server by limiting the number of requests you send to a particular IP address.
Read the Terms and Conditions to determine if you are violating copyright.
Determine if the type of data you wish to collect will breach GDPR (General Data Protection Regulation, a regulation in EU law that protects the information of EU citizens).
Make yourself visible by identifying yourself in your code. This allows website owners to contact you if you make a faux pas or if they need to send you a cease and desist letter.

Here are a couple of articles that outline how to legally scrape the web, some caveats to keep in mind, web scraping best practices, and how to set up your first web scraping project.

Web Scraping for Data Science — Is it legal?

Keep this in mind when scraping websites for your data science projects.

medium.com

Web Scraping: Introduction, Best Practices & Caveats

Introduction:

medium.com

Everything you Need to Know About Web Scraping

Amazing Tool that everyone should know about!

towardsdatascience.com

Ideas for your next web scraping data science project.

Here are some of my favorite articles with ideas to get you started on your next web scraping data science project.

Data Analytics with Python by Web scraping: Illustration with CIA World Factbook

In this article, we show how to use Python libraries and HTML parsing to extract useful information from a website and…

towardsdatascience.com

How to build a data science project from scratch

A demonstration using an analysis of Berlin rental prices

medium.com

Web Scraping Mountain Weather Forecasts using Python and a Raspberry Pi

Extracting data from a website without an API

towardsdatascience.com

Know What Employers are expecting for a Data Scientist Role in-2020

The whole analysis is done from 1000+ recent Data scientist jobs, extracted from job portals using web scraping.

towardsdatascience.com

Final thoughts.

Web scraping is a common tool used in data science that will only become more popular as data becomes available in even greater quantities. Businesses and individuals looking to leverage data to complete any task or to gain any insight will be looking for data scientists who can efficiently, and more importantly, legally scrape the web.

Therefore, it only makes sense that data scientists be aware of the legal pitfalls surrounding web scraping. With exabytes of data being produced each day, the laws surrounding web scraping are likely to change in the future. However, by being aware of current and future laws and trends, you can rest assured that your web scraping projects will be insightful for all the right reasons for years to come.

How to Legally Scrape the Web for Your Next Data Science Project

A quick introduction to web scraping and how to do it legally.

What is web scraping?

When does web scraping become illegal?

How to legally scrape the web.

Web Scraping for Data Science — Is it legal?

Keep this in mind when scraping websites for your data science projects.

Web Scraping: Introduction, Best Practices & Caveats

Introduction:

Everything you Need to Know About Web Scraping

Amazing Tool that everyone should know about!

Ideas for your next web scraping data science project.

Data Analytics with Python by Web scraping: Illustration with CIA World Factbook

In this article, we show how to use Python libraries and HTML parsing to extract useful information from a website and…

How to build a data science project from scratch

A demonstration using an analysis of Berlin rental prices

Web Scraping Mountain Weather Forecasts using Python and a Raspberry Pi

Extracting data from a website without an API

Know What Employers are expecting for a Data Scientist Role in-2020

The whole analysis is done from 1000+ recent Data scientist jobs, extracted from job portals using web scraping.

Final thoughts.

Written by Madison Hunter