How to Legally Scrape the Web for Your Next Data Science Project

A quick introduction to web scraping and how to do it legally.

Madison Hunter
Modern Programmer
5 min readApr 6, 2021

--

Photo by Christin Hume on Unsplash

With exabytes (that’s one quintillion bytes, by the way) of data being produced each day, web scraping is becoming a popular and essential tool for any current or future data scientist.

However, web scraping isn’t as simple as it looks.

The biggest issue surrounding web scraping is the legality of accessing a website’s information. Lawsuits surrounding the illegal scraping of a website’s content have been around since the early 2000s, with big names such as Facebook and LinkedIn being at the forefront of well-publicized legal battles.

With that being said, web scraping is not by nature illegal, if done correctly. This means data scientists need to use an additional layer of attentiveness and judgment when completing data science projects to avoid making any legal faux pas.

Let’s take a look at what web scraping is, when it becomes an illegal practice, and how best to carry out legal web scraping practices.

*A disclaimer: I am not a lawyer and this article should in no way be used as legal advice. This article is for information and entertainment purposes only and should not be considered a perfect account of how to legally scrape the web. The information in this article was curated from several sources. If in doubt about the legality of a web scraping project, contact a lawyer or the owner of the website you plan on scrapping.*

What is web scraping?

In short, web scraping is the process of extracting structured data from third-party websites. Web scraping is generally used to gather data from a website without also gathering unwanted or unrelated information. While the process can be completed by humans, it’s most often done by creating software comprised of two parts: the crawler and the scraper.

The crawler component of the software is similar to most web crawlers in that it scours the website of choice and picks out the content of interest. This content then gets passed on to the scraper.

The scraper component is programmed to find relevant information and to “bookmark” it using data locators. Once complete, the scraper will extract all of the data it has bookmarked and then transfers it to storage in a spreadsheet or database for later analysis.

Here is what the process looks like:

  1. The crawler identifies target websites.
  2. The crawler collects URLs of the web pages you want to use for data extraction.
  3. The scraper requests the URLs to get the HTML code of the page.
  4. The scraper “bookmarks” relevant data using data locators in the HTML code.
  5. The scraper saves the relevant data in JSON, CSV, or some other type of format for later use.

A simple way to look at the web scraping process is to compare it to times you’ve copied and pasted information from a website. Essentially, you’re doing the same task as a web scraper, just in a manual fashion.

Web scraping is a common practice in data science and has many practical applications including stock price analysis, market research, housing price monitoring, and many more.

When does web scraping become illegal?

No matter where you look you’ll find instances of web scraping gone wrong.

A short example of this is in Canada, where a 2019 article in Canadian Lawyer depicts the legal battle between Mongohouse and the Toronto Real Estate Board (TREB). The TREB filed a successful lawsuit against Mongohouse alleging that Mongohouse’s entire business was based on its web scraping of the TREB MLS system and the unauthorized distribution of TREB MLS information for commercial purposes. In short, the Federal Court made it clear in its ruling that the unauthorized web scraping of third-party content without explicit consent is illegal.

This is an extreme example, but it gives insight into how the world of web scraping is much more complicated than it appears.

While reading a website’s Terms and Conditions can give you some information on whether or not they allow web scraping or even which parts of their website are allowed to be scrapped, it’s nearly impossible to truly know if you’re on the right side of the law without a lawyer. Furthermore, there is a whole slew of legal and ethical considerations to take into account that could unknowingly make your scraping project illegal.

Therefore, it’s important to tread lightly.

How to legally scrape the web.

Web scraping is legal as long as you play by the rules.

So, it’s important to follow the golden rule of web scraping: when in doubt, ask.

I can’t stress this enough. If there is any doubt in your mind of the legality of scraping a particular website, contact the owner of the website and ask for permission.

Keep these key points in mind when scraping the web:

  • Check Robots.txt before scraping a website. Robots.txt will inform you on which parts of a website you can scrape and which parts you need to avoid.
  • Don’t harm the website or server by limiting the number of requests you send to a particular IP address.
  • Read the Terms and Conditions to determine if you are violating copyright.
  • Determine if the type of data you wish to collect will breach GDPR (General Data Protection Regulation, a regulation in EU law that protects the information of EU citizens).
  • Make yourself visible by identifying yourself in your code. This allows website owners to contact you if you make a faux pas or if they need to send you a cease and desist letter.

Here are a couple of articles that outline how to legally scrape the web, some caveats to keep in mind, web scraping best practices, and how to set up your first web scraping project.

Final thoughts.

Web scraping is a common tool used in data science that will only become more popular as data becomes available in even greater quantities. Businesses and individuals looking to leverage data to complete any task or to gain any insight will be looking for data scientists who can efficiently, and more importantly, legally scrape the web.

Therefore, it only makes sense that data scientists be aware of the legal pitfalls surrounding web scraping. With exabytes of data being produced each day, the laws surrounding web scraping are likely to change in the future. However, by being aware of current and future laws and trends, you can rest assured that your web scraping projects will be insightful for all the right reasons for years to come.

--

--

Madison Hunter
Modern Programmer

CAN | +1M views | Data Science, Programming & Learning | TerraBytes Newsletter: https://terrabytes.substack.com/