Is web scraping legal in 2020?

Jurgenstojku
4 min readAug 20, 2020

--

Although web scraping has become an increasingly common practice, the term still comes with some baggage. Maybe it’s due to the automated nature of web scraping or the general lack of available information… or maybe it’s because “scraping” just sounds so unpleasant. Whatever the reason, web scraping is often perceived as a shady, ‘black-hat’ practice, a perception that’s reinforced when people extract data irresponsibly. In reality, as long as you’re mindful and follow a few intuitive guidelines, web scraping is an ethical and perfectly legal way to collect useful information from the web. Web scraping in the USA — Computer Fraud and Abuse Act (CFAA)
Passed in 1986, the CFAA prohibits intentionally accessing a computer without authorization or in excess of authorization. The trouble is, the act in its original form failed to define what “without authorization” actually means. Luckily, subsequent amendments, along with a few recent cases, have outlined the legal boundaries that apply to web scraping and data extraction. In 2017, Craigslist sued Instamotor for scraping user data and using it to send ads to Craigslist users. Craiglist won the suit and was awarded a $31 million judgment. It’s worth noting that the collection of the data was not at the heart of the judgment; rather, Craigslist won the because of what Instamotor did with the scraped data — namely, thousands of spammy and borderline fraudulent emails sent en masse to Craigslist users.
In late 2019, the US Court of Appeals denied LinkedIn’s request to prevent HiQ, an analytics company, from scraping data from its site. This decision set a further precedent for the legality of extracting public web data.
Web scraping law in Europe — GDPR
Under the EU’s General Data Protection Regulation (GDPR), web scraping restrictions do not apply to a person or company unless such an entity extracts personal data of people within the European Economic Area.
Web scraping legislation varies by location and industry, but following the tips below will help you in most situations.
Best practices

  1. Learn about good web citizenship
    Like any other online activity, responsible web scraping starts with a commitment to good web citizenship. Familiarize yourself with CFAA, GDPR, CAN-SPAM, REP, and other legislation and web standards to make sure you’re in the right.
  2. Don’t violate copyright
    Whether or not it’s collected via web scraping, copyrighted information is off-limits without the written authorization of the copyright holder. Copyright protection comes into play once data is extracted. A web scraping tool could be used to search YouTube for video titles, for example, but the entity could not re-post the Youtube videos on its own site since the content is copyrighted. Copyright infringement is a quick and easy way to send your organization into a legal minefield, so tread carefully.
  3. Pay attention to Robots.txt
    Most websites use a file called robots.txt to set limits on the content that can be accessed by web crawlers. This is used mainly to avoid overloading the site with automated traffic. Although robots.txt is targeted at web crawlers rather than web scrapers, it’s a good practice to review a site’s robots.txt for any pertinent restrictions.
  4. Limit your crawl rate and request frequency
    Web scraping allows you to quickly automate research tasks that would take human user hours or even days to complete. This makes web scraping great for quickly gathering data, but it can also cause problems by flooding sites with requests. Limiting the crawl rate and request frequency of your web scraping projects allows you to quickly gather data without causing problems on sites where you are collecting data.
  5. Use an API
    Many sites and products provide data access via an Automated Programming Interface. Depending on the type of data you need, APIs can be a good alternative to web scraping.
  6. Don’t violate Terms of Service
    You know all that tiny text you scroll past when signing up for something? It’s called the Terms of Service, and it actually matters. Review a site’s ToS before beginning a web scraping project to ensure you’re not doing anything prohibited.
  7. Stick with public information
    When an organization puts information on their website, they’re making it publicly available. In most cases, public information also refers to information available to logged-in users of a site or service. If you have authorized access to non-sensitive data, you can likely scrape it.
    If all of this seems a bit overwhelming, don’t worry; there are a number of web scraping tools that take the guesswork out of data extraction. Below are a few providers that can help you stay legally compliant while collecting actionable data.

Mozenda
Mozenda was the first company to bring an interface-based web scraping tool to the market, and they remain a big name in the web scraping industry. They offer 24/7 live service to help you build your web scraping project.
Dexi
Based in the EU, Dexi offers a web scraping and business intelligence platform for enterprise businesses. Dexi specializes in data integrations and is a great choice for complex and large-scale projects.
Octoparse
Relatively new to the web scraping market, Octoparse is a simple cloud-based scraping tool popular in Asia.
Scraping hub
Scrapinghub is a hassle-free cloud-based data extraction tool that helps companies fetch valuable data and store it in a robust database.
Scraping bot
Scraping-Bot.io is an efficient web scraping tool with a fully-featured API. You can test it out for free.

--

--

Jurgenstojku

Marketing Specialist, Visual Designer, Thinker, Explorer, Curious, Cat Person and Data Lover.