Law & Ethics of Scraping: What HiQ v LinkedIn Could Mean for Researchers Violating TOS

could you use this pen to sign away your soul to a LinkedIn TOS?! (via Flickr: LinkedIn pen | Lisa Scarborough | CC-BY 2.0)

“Is it legal for me to violate Terms of Service in order to collect data for a research project?” As the token legal expert among computer scientists, I get this question a lot. It’s also a huge point of contention right now, particularly in the social computing research community. Because the problem is, some people abide by TOS and others don’t. Some people (like me) came to their research labs as a grad student and said “um guys, I hate to be the buzzkill lawyer in the group, but Yik Yak’s TOS say we can’t scrape,” while others wrote awesome papers using Yik Yak data. Some reviewers will ding you for being unethical for breaking TOS, while others don’t know or don’t care.

My PhD advisor Amy Bruckman wrote a blog post last year about this unfairness, laying out the sticky ethical landscape. It got a lot of attention, partly because a lot of researchers don’t like the suggestion that they shouldn’t be doing the kind of work they’re doing. Also, it’s a very gray issue — especially when you consider the difference between legal and ethical, and these are two very different things. But right now, let’s talk about legal. Because thanks to a court decision yesterday, researchers might be in better positions than we previously thought.

The problem with violating TOS is that one could, technically, be violating the Computer Fraud and Abuse Act, which makes it a federal crime. The CFAA has not been frequently used in this context, but the case against Aaron Swartz made it clear that this kind of enforcement is not only possible but potentially devastating. Moreover, we have seen companies like Zillow threaten users with the CFAA for violating TOS.

To be clear, the CFAA criminalizes hacking, or “accessing a computer without authorization.” It was also enacted in 1986 — i.e., before public Internet data and scraping were even a glimmer in the eye of lawmakers. However, arguably, if a site’s TOS says “if you are on this site you cannot do this thing” and you do that thing, then you are accessing that site without authorization. Almost a decade ago, there was an attempted prosecution under the CFFA for the instigator in a cyberbullying case that resulted in a teen suicide; she had violated MySpace’s TOS by creating a fake account. And though we may have thought she should be guilty of something, this would have set terrible precedent. After all, think of how many times you’ve broken TOS. Come to think of it, do you even know what you click-to-agree-to for all the sites you visit? (I’ve conducted research about TOS and can tell you, probably not.)

The most recent court case to cover this issue concerns HiQ, a company based around a “talent management algorithm”; it scrapes public data from LinkedIn and sells reports to employers about employees that may be job searching. LinkedIn sent hiQ a cease-and-desist to stop scraping, and threatened them with the CFAA. Though scraping is against LinkedIn’s TOS, that wasn’t actually relevant to the case— because of the C&D, LinkedIn told HiQ even more directly not to scrape.

And since this case is specifically about scraping public data, which is what many researchers do, it’s particularly relevant for us. Yesterday, a district court judge in California ruled against LinkedIn.

One argument the judge made is a pure common sense one. The law just wasn’t intended to do this:

The CFAA must be interpreted in its historical context, mindful of Congress’ purpose. The CFAA was not intended to police traffic to publicly available websites on the Internet — the Internet did not exist in 1984. The CFAA was intended instead to deal with “hacking” or “trespass” onto private, often password-protected mainframe computers.

The judge goes on to point out that publishing a website implicitly gives the public permission to access it. Revoking that access on a case-by-case basis could have serious consequences that Congress certainly didn’t intend.

The CFAA as interpreted by LinkedIn would not leave any room for the consideration of either a website owner’s reasons for denying authorization or an individual’s possible justification for ignoring such a denial. Website owners could, for example, block access by individuals or groups on the basis of race of gender discrimination. Political campaigns could block selected news media, or supporters of rival candidates, from accessing their websites. Companies could prevent competitors or consumer groups from visiting their websites to learn about their products or analyze pricing.

If an argument about discrimination sounds familiar, you might have heard about the ACLU on behalf of researchers suing the government because the CFAA “unconstitutionally criminalizes research aimed at uncovering whether online algorithms result in racial, gender, or other illegal discrimination in areas such as employment and real estate.” In other words, some kinds of research into algorithmic discrimination requires violating TOS — for example, to spoof multiple accounts in order to see if algorithms provide different results based on race. It is unclear how relevant the LinkedIn case might be since it is specifically about public data, but it can only be more precedent against this use of the CFAA.

Much of the decision is based on an argument put forth in an essay by law professor Orin Kerr, arguing for shared norms around access control. The way that websites distinguish between public and private access is through password control. This point is what distinguishes HiQ v LinkedIn from an earlier CFAA case around Facebook — because that case involved revocation of access that was not public. Instead, the conclusion here, quoting the wording of the CFAA, is that:

A user does not “access” a computer “without authorization” by using bots, even in the face of technical countermeasures, when the data it accesses is otherwise open to the public.

Though it is also important to note that this does not preclude a site from protecting itself against malicious attacks. If LinkedIn argued that HiQ had overburdened their servers they might still have had recourse. According to the decision:

This is not to say that a website like LinkedIn cannot employ, e.g., anti-bot measures to prevent, e.g., harmful intrusions or attacks on its server. Finding the CFAA inapplicable to hiQ‟s actions does not remove all arrows from LinkedIn‟s legal quiver against malicious attacks.

Barring whatever might happen on appeal, this is good news for researchers who scrape data, even against TOS. However, this is also about the legality of scraping — not about the ethics of scraping.

As Amy pointed out in her blog post, there are times when scraping could be illegal but still ethical, because “scholars cannot cede control of how we understand our world to corporate interests.” What if a company only allowed public data scraping from researchers who agreed that they would only publish their findings if they are positive about the company? (Granted, that is already an issue re: data that only corporate researchers have access to, but that is a whole other conversation!)

But now that we have this ruling, it is also important to keep in mind that even if violating TOS is legal it might not always be ethical. Remember the example of Yik Yak? At the time, I was pretty bummed — particularly after we asked the company if we could have permission to scrape, and they said no. But consider why this particular platform might not have wanted researchers to scrape. Yik Yak was ephemeral by design. Its users had an expectation that that data would not be archived or available beyond its appearance on the platform. Though researchers might not have intended to make the data available, they could have — particularly since in some disciplines it is customary to publish datasets along with analysis.

For any ethical decision making process, it is important to consider the specific context, including issues of contextual privacy. Consider the case of the OKCupid data scrape a couple of years ago. A researcher used a bot posing as a logged-in user to scrape the entire site — and then released the dataset publicly with no anonymization, usernames included. Even within the research community, there seemed to be a strong consensus that this was not okay. Sure, an OKCupid profile might be technically “public” in that anyone can create an account and see it — but that doesn’t mean that the users expect there to be a searchable database of information about their sexual preferences made public and connected to their usernames that they might also use elsewhere.

Additionally, the distinction that HiQ v LinkedIn makes with respect to passwords and access control could be a useful rule of thumb for researchers deciding what data is ethical to scrape. I have seen arguments, for example, about whether it is okay to gather data from an online community where only logged-in users can access posts. Even if literally anyone can make an account and log in, does that extra layer of access control provide a higher expectation of privacy for users?

There are a lot of sticky ethical questions regarding researchers’ uses of public data, and how to collect it is just one of them. This is one of my major research areas right now, and I can tell you that it’s complicated. But I think that of all the things we should be considering when making ethical decisions about our work, whether we could be committing a crime (or encouraging our students to commit a crime) probably shouldn’t be one of them. Though I also hope that researchers continue to consider contextual ethical issues when making decisions about scraping — remember that just because you can do something doesn’t always mean you should.