“There is always an element of judgement”

DataKind UK

Published in

DataKindUK

7 min readOct 26, 2021

DataKind UK’s Conversation on Web Scraping

by Laura Carter, DataKind UK Ethics Committee member

At the DataKind UK Ethics Committee, we are asked a lot of questions about data ethics by volunteers, charity partners, and even our friends and family. The practical, ethical, and legal aspects of scraping the web — extracting data from websites, often using automated software — come up time and time again. So to help us understand more, we hosted ‘A Conversation on Web Scraping:’ a panel discussion hosted by Ethics Committee Chapter Lead Stef Garasto.

We were thrilled to be joined by Eric Barrett, Data Desk Manager at the Organized Crime and Corruption Reporting Project (OCCRP); Tom Higgens, Head of Data and Innovation at STOP THE TRAFFIK (STT); and Janis Wong, PhD Researcher at the School of Computer Science, University of St Andrews.

Applications for scraping

More and more crime is cross-border, Eric Barrett at the OCCRP began, and this includes crime-as-a-service. OCCRP is a network supporting investigative reporting in this area, and for them, web scraping is a means to an end: it enables them to answer questions and make connections that would otherwise be very difficult to see. And they do a lot of it: “Sometimes I feel like I’m a professional scraper!” Eric told us.

Janis Wong uses web scraping in her research on company privacy policies online: in particular, she looks at what has (or has not) changed since the introduction of the GDPR. These policies can be long and convoluted: web scraping allows her to examine whether companies are doing what they say they are doing and what the law requires when it comes to privacy. It also helps her detect whether a privacy policy may be being automated, or written by a third company.

STOP THE TRAFFIK identifies human trafficking not only as a crime, but as a profitable one. Tom Higgens told the audience that tracking money can make this hidden crime visible, and scraping publicly available information is one way to do this tracking. Recently, web scraping has enabled STT to gather information more efficiently, and helped them to collect information in French and Spanish as well as English, to create a fuller picture of what is actually happening across networks they monitor. As Eric commented, “It takes a network to find a network.”

The complex legality of web scraping

One of the most common questions the Ethics Committee is asked is about the legality of scraping data. All panellists agreed that this is a complicated area, not least because of different laws in different countries. Both Tom and Eric emphasised the importance of getting legal advice as early on in the process as possible: as charities, STT and OCCRP are able to get pro bono legal support. OCCRP as a network can help support journalists who are worried about legal consequences. “We need to sit down at the start of the project and think through what are the different legal frameworks, what are the different rights frameworks — and take that to a steering or ethics committee if we need to,” said Tom.

Janis noted that there were several areas of law to consider: first, whether the scraping breaches intellectual property rights, such as copyright and licensing agreements. Second, whether scraping is in breach of terms and conditions guiding the use of publicly available websites. And finally, data protection law, particularly if personal data is being scraped. She told the audience that the law varies across different countries (even within the EU): in some countries, for example France, data that is publicly available but which pertains to an individual may still be considered personal data. She added, “There’s never a clearly defined line.”

Janis continued that some scraping findings can be very powerful — for example, mapping environmental or community destruction — but go against a site’s terms of service or even against the law. Eric gave the example of situations in which legal and ethical considerations do not align, especially in countries that do not have strong rights-protecting frameworks. On the OCCPR site, information that comes from scraped data has to be traced back to its source, fact-checked, and challenged from several perspectives before it can be used by journalists. However, they are seeing an increase in geofencing and other methods that make it harder to do cross-border scraping.

None of the panellists could point to specific examples of non-profit organisations who had been taken to court and/or fined when scraping for social good reasons, but Tom noted that some have been fined for gathering personally identifiable information, which can happen when scraping. Janis mentioned a case where scraping of Ordnance Survey data by a private company was found not to be allowed. Eric told the audience that OCCRP makes some of the scraped data available for others to use. In response to takedown requests and other potential legal challenges, they have made the decision to make certain datasets no longer publicly available (but still available to journalists as OCCRP is covered by the journalism exemption under GDPR).

The panellists also talked about scraping social media sites, and the difference between collecting information about public figures and the public in general. Janis pointed out that there are already legal tests for whether collecting information is in the public interest, while Eric noted that in general, it is not a good idea to allow those in positions of power to dictate the terms by which they are held to account.

All panellists emphasised that when scraping, it is important to think about where the scraped data is stored, and how it will be shared. Eric commented that even if the data that OCCRP is scraping is publicly available, they treat personal, identifying information carefully. Tom described STT’s practice of usually storing anonymised information: they don’t need identifying details, so it’s better for the organisation as well as for individual data rights.

Scraping for good

Tom talked about using scraping as a force for good, to help STT make better decisions and understand what is happening: but recognised that web scraping is just part of this work, and that there’s always a risk that it will look at things that STT already knows about, rather than challenging perceptions and understanding. “There is always a risk that we are just looking for our keys under the lamppost,” he added. Janis noted that web scraping helps researchers: you don’t necessarily have to be a big tech company to analyse scraped data at scale.

For Eric, the point of investigative journalism is to uncover things that people would rather keep hidden because they are harmful or even criminal. Web scraping can help: there are areas that OCCRP knows about, but sometimes it takes an obscure dataset to link other pieces of information. He pointed to a recent example: Luxembourg’s recently opened company register OpenLux. Following research into the contents, Luxembourg pledged to improve their due diligence to prevent money-launderers from exploiting their systems.

Scraping risks

Eric explained that OCCRP asks questions where they believe the answers would serve the public good: this guides their ethical approach. As well as considering how to safely and legally store personal information, they consider how the act of scraping — which is noticeable to site owners — might put journalists at risk, and what measures they need to take to protect journalists. OCCRP also considers the resilience of servers: taking them down by using scraping techniques that ask for too much data in a short period of time might cross a legal line, so they do their best to avoid this. Tom noted that even though STT’s scraping work focuses on big news outlets, there is always an element of judgement to be used when deciding if and what to scrape. STT is a human rights organisation, he said, so it can’t ignore other human rights issues when challenging trafficking.

Both Eric and Tom talked about the potential wider impact of scraping for their work. Eric noted that the owner of a website is entitled to block access, even if it’s publicly facing. Tom mentioned the risk of being cut off from social media, which STT uses in their awareness-raising work. Janis added two high-profile cases of scraping related to Facebook. One was the use of scraping by Cambridge Analytica, which may have been allowed at the time. The other was NYU’s team of disinformation researchers, who were planning to scrape in order to find out what Facebook was doing behind the scenes: this was deemed to breach the company’s terms and conditions.

Both Janis and Tom also warned about gathering potentially biased information from scraping news sites. Tom commented that some things are reported and others not, while news reports can be wrong but scraping may not necessarily pick up later retractions, nor clarifications that emerge later. Janis emphasised that while it might be legally acceptable to scrape news sites (as long as you’re not monetising the information), it’s important to be careful about the extent to which this data is taken at face value.

Where to get help and advice

All three panellists agreed that different applications of scraping needed to be considered on a case-by-case basis. For UK-based organisations, Janis recommended the most up-to-date version of ONS’s Web Scraping Policy as a resource.

They also all recognised the importance of web-scraping to their work: but emphasised that it’s a murky area, both ethically and legally. They encouraged charities and other organisations who are thinking of starting web scraping projects to seek legal support as early as possible and make sure that they are fully aware of legal, ethical, and contextual challenges, especially when that scraping may happen in different jurisdictions. Scraping can provide real benefits to charities, especially by providing information that isn’t available elsewhere: but it’s important to think through all the ramifications when embarking on a project.