Web Scraping. So Easy, It Could Be Illegal.

Published in

Geek Culture

10 min readAug 19, 2021

Scraping — not just for cocktails. Photo by Kike Salazar N on Unsplash

Web scraping — extracting data from websites — is popular. Articles on how to scrape efficiently are numerous. And there is certainly an appeal since scraping gives access to a lot of interesting data. Data people can analyze, process, and potentially sell. With the increased popularity of Data Science, the need for raw material (data) and information on how to get it has similarly grown.

Discussing the legality of scraping, however, is less common. And when it happens, multiple aspects are sometimes mixed: laws, court rulings, technical barriers, scraping frameworks, etc. Presumably, the excitement to work with data is too great to consider all consequences. And the impact of a single person extracting data for a personal project, e.g. when learning new technologies, is too small to feel relevant.

This article wants to untangle the thread and sort potential issues. Giving you a step-by-step checklist of whether your scraping project is in a legal gray area.

Initially, we look at whether your technological approach is acceptable. Once we established that, we turn to how you plan to use scraped data. And last, but certainly not least, we look at concerns regarding personal data.

Disclaimer: I am not a lawyer. I took part in and chaperoned several discussions between legal, technical, data science, data engineering, and consultancy teams, though. The legal situation absolutely varies in different countries. Bring in a legal consultant with technical and local knowledge for your specific scenario. The following points apply to Germany, and they give you an initial starting point. If you consider them to be harsh: err on the side of caution. It should keep you safe. And, last but not least, a situation never settles. Governments pass new laws. Courts pass rulings. What we accept today may no longer be true tomorrow.

Fake O’Reilly book cover with cat — If it was real, it would sell (created with O RLY Parody Cover Generator https://dev.to/rly)

Also, note that this is a take on legality established by laws and courts—not on how it should be, whether laws need to change, or whether you support the EFF. There is a case to be made for free data. There is also a strong case to be made to protect your personal data. And there is a case to be made to protect your business and investment. Let’s keep it focused and as concise as possible.

Scraping Fundamentals

First the basics. “Scraping”, web scraping, is extracting data from a website. Regardless of the tool. You can cURL the source code, you can use Selenium, PhantomJS, Scrapy… there are many tools that make scraping easier. There are also many ways to increase your throughput: proxies that hide who is accessing a site, and frameworks to orchestrate multiple concurrent queries.

As mentioned above, if you are a developer or data scientist looking for raw material — data — to hone your skills, and you tap into one of many prepared datasets, then that is not web scraping.

Scraping also deals with mostly unstructured data that you need to extract. To access data in a scalable, stable way, APIs are always the better option. As they offer structured responses and usually this structure is (relatively) simple. In the source code of a website, relevant data needs to be found. Depending on how well-built this source code is, your scraping mileage may vary. Some elements might be easier to access than others. That also depends on the amount of work and diligence the developer and maintainer have poured into the website in question.

Below is an example of how web scraping works. A product page on e.g. Amazon shows the total average rating and number of reviews. In the source code on the right, I highlighted the elements with this data. To scrape it, find an anchor point that describes the position inside this HTML document. That could be an element ID, CSS class, or the n-th child of another element. And this is the crux of web scraping. If the developers decide to restructure the code or rename CSS classes, your scraper breaks and you have to change it. Maintenance for web scrapers is usually high and an ongoing job.

But, maybe there is no API. Scraping may be your only option to access this data source that you really want.

Example: scraping reviews from Amazon product page. (Image by author)

Technical Aspects

This highlights the first potential conflict: “data that you really want”. Someone else has created a dataset that would be beneficial for your purposes. And you want to extract something that belongs to someone else. Is that legal?

First: did you accept terms of use when you scrape the data? Either when you created an account, or when you log in? Do these terms prohibit scraping or data extraction? If so, your scraping is not legal.

Sometimes data is publicly available. No login or account is needed. No acceptance of any terms needed. Then you took the first hurdle.

Fence with blurred landscape in background — A fence is usually there for a reason. Photo by Simon Maage on Unsplash

Second: are there any mechanisms that prevent scraping? Captchas or IP blacklisting after too many queries? Or the good old robots.txt? Multi-factor authentication? If so, do you implement a clever countermeasure to circumvent them? In that case, your scraping is not legal. Sometimes the operator of a service does not want you to access their data automatically. Then they build a fence, you need to abide by it. Tearing down the fence to get to the green grass is not legal.

Robots.txt is, arguably, not a powerful defense and you probably get away ignoring it. It is a simple text file telling e.g. search engines which pages are OK to access, and which should be ignored. However, it states a desire of the hosting website not to be accessed by scrapers and crawlers — either in full or in part. Ignoring it will not be a deciding factor on your scraper's lawfulness, and US courts have rules that ignoring it does not constitute “hacking” a website. In combination with other points (see below) it might just tip the scale, though.

In the United States, the Computer Fraud and Abuse Act (CFAA) enforces this. As one key target is penalizing trespassing / hacking someone else’s computer system or committing a cybercrime, courts have applied it to scraping. In 2012, Craigslist accused 3Taps and PadMapper of bypassing an IP blocker through the use of proxies, scraping classified ads on Craigslist without consent. Conversely, when no technical barriers are in place, and no terms of use must be accepted to retrieve public data, then HiQ v. LinkedIn alleges CFAA does not apply to prevent scraping.

In Germany, a ruling by the Bundesgerichtshof in 2014 stipulates a similar requirement. A price comparison website scraped data from an airline website to offer booking services. The airline had no technical barriers in place to prevent scraping. Nor required acceptance of their terms of use in order to see flight information.

If the data is publically accessible, then all is good so far and you took the second hurdle.

Scrape responsibly, take only what you need. Photo by Florida Guidebook on Unsplash

Third: do you scrape responsibly? Scraping can be a lengthy process. Fetch web source code, process it, throw away 95% of it only to get to the 5% that are relevant. Not very efficient. To speed this process, scraping usually leverages multiple simultaneous queries through various proxies. Either way, the result is more queries to the web server, more load on their systems, and a slower experience for other — regular — users. To compensate, the website operator would increase server capacity. Which incurs a cost to them. Or if they do not, you might run an unintentional denial of service attack against them when you overload their servers. That definitely hurts their business.

If you do that, your scraping activities are illegal. Your scraping activities must not have a significant negative impact on your target's system performance. What does “significant” mean in measurable terms? There are no hard numbers for that, and you can’t inspect your impact on the target system. Larger corporations usually handle a lot of traffic already, while smaller sites may feel the impact of your activities. If you, for example, reduce the number of concurrent queries, cache already retrieved data, and do not stress their systems, you took the third hurdle. Note that these are general suggestions, not legal requirements. Not caching data, for whatever reason, does not make your scraping activities automatically illegal. Just that caching data does not make it automatically legal.

Copyright Aspects

Fourth: how much do you scrape from their system? We are now leaving the technical realm and venture into copyright protection. The owner of the website is the owner of the database that contains the data you want. And that grants them rights for copying and distribution of data that is stored in their system.

If that data is user-generated content, like reviews, blog posts, images, then the creating user still keeps all rights to their content. After uploading a picture to a social network, other rights (e.g. to publish or sell the picture elsewhere) are still retained after the upload. And the social network may store, backup, delete and display a digital copy of a picture.

Why is this relevant? To show that the web service you are scraping has rights regarding the digital content you are trying to extract. Even if they did not produce it originally. Again, debating whether this is good or bad is outside the scope of this article.

And based on current rulings, if you extract more than an “immaterial” amount of content, your scraping is not legal. Note that even though the German Bundesgerichtshof established 10% as “immaterial” in 2011, other courts in Germany have ruled that 10% or less can, in specific cases, already be substantial.

If you only extract a small amount of the entire database, e.g. a single country, or a subset of products, or a subset of product information, you took the fourth hurdle. Even if you probably will never know the exact percentage of data you actually extract.

Fifth, what are you intending with the data you retrieved? Are you planning on offering a comparable service? Or worse, offer the same service based on scraped data? Are you selling the extracted data without modifications? Then you are violating copyright (and likely competition) laws, and your scraping is not legal.

The fourth and fifth hurdle relate to each other. If you only scrape a small subset of data from a site, it will be hard to argue that you are trying to set up a competing business. If you extract all the data, then the question will be: why?

If you process, analyze, aggregate, or combine data from other sources to offer additional value, then you took the fifth hurdle.

Respect privacy. Photo by Matthew Henry on Unsplash

Data Privacy Aspects

Sixth, what type of data are you extracting? Leaving copyright law, we are entering privacy law now. And while GDPR from the European Union may not apply in all countries, it can still offer several guidelines even if you do not fall under it. In fact, countries like China drafted privacy laws resembling GDPR, using it as a guideline: the personal information protection law. It stipulates that personal data from minors is worthy of protection and you need consent from parents to handle their kids' data. If you scrape a website, getting consent from any user is unpractical.

If the extracted data contains personal data, your scraping is not legal. What exactly makes up personal data is a separate discussion. Usually, any data point sufficient to identify a specific person — either by itself or in combination with other data points — is personal data. Email addresses are surely personal data. In combination, name, zip code, and birth year can also be. Chances are, you do not need this data. Do not extract it. Or apply a hashing algorithm if really needed.

Summing it all up. Photo by JESHOOTS.COM on Unsplash

Summary

In summary: scraping may be the only way to get interesting data. And sometimes, nobody might realistically even know if you violated laws. If you scrape data on a larger scale, or for-profit, checking the legality of your actions against local law is advisable.

To start, consider the condensed checklist below for your specific case. Do you feel confident that you can check all boxes? No matter how large, small, individual, or established your project is, the generalities are the same.

Checklist for data scraping, summarizing the points from the article. — Do not wander off into dark or gray areas. Check your compass. (Image by author)

As each case is unique, use your best judgment and consider the hurdles above with a rational mind. Turn the tables mentally — consider how you would feel if you were the target of your scraping.

Remember that scraping can touch a variety of legal fields. From data security to competition and potentially personal information. Each comes with its own set of challenges. And only if you clear all and get legal counsel, you can be truly sure.