Tom Bescherer
ACLU Tech & Analytics
3 min readNov 18, 2021

--

The Golden Rule of Data Access: Scrape Others as You Would Want to be Scraped

Photo by Ilya Pavlov on Unsplash

Photo by Ilya Pavlov on Unsplash

The ACLU Analytics team regularly needs to scrape government websites for data to support our work. For example, in 2020 we scraped certain prison websites to gather data on the inmate population as to monitor whether the prisons were complying with court orders to reduce their inmate population in order to fight the spread of Covid-19.

[Quick aside: Our team is hiring a data engineer! https://www.aclu.org/careers/apply/?job=5706046002&type=fulltime]

We believe this sort of scraping constitutes legal access to public data and that we are well within our rights to access it. However as data professionals ourselves we are quite sympathetic to the needs of the server administrators on the other side of the web servers we are scraping. This server administrator might themselves in principle be in support of free access to public data; however someone simultaneously submitting 100,000 requests to the 15 year old ASP.NET server they have tucked away in a municipal building basement might not be quite what they had in mind.

And so we at the ACLU have had a number of discussions internally about “ethical scraping” and spent some time familiarizing ourselves with the number of great blogs that already exist on this topic.

After taking in a lot of the online discussion, we had an abstract sense of what constituted ethical scraping in general, but still found ourselves wondering what parameters specifically should we be plugging into our scraping code? Should the scraper have a concurrency of 1, a concurrency of 16, or a concurrency of 4112?”

The standard we settled on for our projects is that ethical scraping is scraping that visits a website in a manner that matches what a determined human being could achieve when visiting through a web browser. This itself is a fuzzy standard, but it is intuitive enough to allow us to make general technical decisions. A human could not have a concurrency of 4112. Most humans would not pull an all nighter refreshing the same website error page over and over unless sneakers or concert tickets were involved.

So with that in mind, here are a couple recommendations for how to tune your scraper:

Concurrency: Set your concurrency near the level that could be achieved by a human, 1 or 2.

Robots Txt: Always honor a robots.txt by default. If you choose to disobey a robots.txt it should be because you made an informed decision to commit an act of civil disobedience, not because you simply didn’t check.

Failure threshold: Implement an error threshold for your scraper and halt your run if it is exceeded. Neither you nor the web admin benefit if you send 100,000 failed requests to their server.

Hopefully someday a standard will be implemented for the web that facilitates access to public data that is easy on both requestor and requestee. But until then, happy scraping and remember the golden rule: Scrape others as you would want to be scraped!

--

--

Tom Bescherer
ACLU Tech & Analytics

Director of Data Infrastructure at ACLU. Opinions are my own.