Neil Turkewitz
8 min readMay 21, 2019
Photo © 2019 Neil Turkewitz

Sustainable Text & Data Mining, Part II: US and Fair (and Unfair) Uses

by Neil Turkewitz

As I noted in Part I, action by the EU on text and data mining (TDM) raises an obvious question about the treatment of TDM under US law. Last week I looked at the exceptions for TDM adopted by the EU in the Copyright Directive. Here I take a look at the intersection of fair use and TDM under US law. Many observers mistakenly assert that TDM is categorically excused by fair use principles, and cite to the decisions in Google Books and HathiTrust for support for the notion that data and text extraction is definitionally transformative and, almost by definition, squarely within fundamental fair use principles. This is a mistake, and represents a misreading of those decisions, and of US law — perhaps as best illustrated in the TVEyes decision. While it is tempting to read Google Books and Hathi as holding that extraction/mining without distribution is fair use, that would represent an oversimplification of those decisions, and would create too bright a line between extraction and post-extraction expressive use. It would also represent over-reliance on transformativeness for justifying any use. The finding that the post-extraction expressive uses in question in Google Books and Hathi were covered by fair use was an integral part of finding that the extraction itself was protected by fair use. Where the expressive use is infringing, courts are also likely to find extraction infringing as well. This complicated relationship is nicely explored in this excellent piece by Benjamin Sobel in which he adds a further complexity — with AI, the distinction between extraction and expressive use becomes wholly illusory. He writes: “protectable input data are commonly used to train models to generate similar output. If that similarity is “substantial,” then that output may infringe copyright in the pre-existing work or works to which it is similar — or, at least, it could be found infringing if it were rendered by a human.”

Under US law, data and text mining may or may not be fair, depending on a number of factors, including whether there are, or are likely to be, mechanisms for licensing it (Texaco) and the use to which the data/text will be used. It is instructive to note that Judge Leval, author of the Texaco and Google Books decisions, views these respective decisions as being entirely consistent, each of them based on the most fundamental factor to be considered in relationship to the fair use inquiry — “whether there was harm to the author’s economic interest in the copyright. The Supreme Court said in the Nation case that this is the most important thing.” Even the father of transformativeness thus warns against over-reliance upon it as an indication of whether a use is “fair.”

The fact is that analysis of transformativeness is largely circular — if a use seems fair and non-prejudicial to the interests of the author, there is a significant chance that the Court will find the use transformative. “Transformative” frequently ends up being shorthand for a finding that the use is consistent with the principles underlying fair use and the dictates of international law on the scope of exceptions. A critical factor to be considered is whether the use interferes not just with the exploitation of an existing work, but with the potential licensing opportunities for the author in a rapidly evolving landscape. This is not static, and while the existence of extant licensing mechanisms is an important factor (see Texaco), it is not determinative. It is also instructive to note that Leval’s warning that a determination of fair use will ultimately turn on examination of harm and not transformativeness was borne out in TVEyes in which the Court found that the use was transformative, but not fair.

TVEyes provides an essential backdrop for understanding and properly contextualizing the Google decision. The Court held:

“It is indisputable that, as a general matter, a copyright holder is entitled to demand a royalty for licensing others to use its copyrighted work, and that the impact on potential licensing revenues is a proper subject for consideration in assessing the fourth factor.” Bill Graham Archives v. Dorling Kindersley Ltd., 448 F.3d 605, 614 (2d Cir. 2006) (quoting Texaco, 60 F.3d at 929). However, “not every effect on potential licensing revenues enters the analysis under the fourth factor.” Texaco, 60 F.3d at 929. A copyright owner has no right to demand that users take a license unless the use that would be made is one that would otherwise infringe an exclusive right. See Bill Graham Archives, 448 F.3d at 615. Even if a use does infringe an exclusive right, “[o]nly an impact on potential licensing revenues for traditional, reasonable, or likely to be developed markets should be legally cognizable when evaluating a secondary use’s effect upon the potential market for or value of the copyrighted work.” Texaco, 60 F.3d at 930 (internal quotation marks omitted).

That limitation does not restrict our analysis here. The success of the TVEyes business model demonstrates that deep-pocketed consumers are willing to pay well for a service that allows them to search for and view selected television clips, and that this market is worth millions of dollars in the aggregate. Consequently, there is a plausibly exploitable market for such access to televised content, and it is proper to consider whether TVEyes displaces potential Fox revenues when TVEyes allows its clients to watch Fox’s copyrighted content without Fox’s permission.

By providing Fox’s content to TVEyes clients without payment to Fox, TVEyes is in effect depriving Fox of licensing revenues from TVEyes or from similar entities. And Fox itself might wish to exploit the market for such a service rather than license it to others. TVEyes has thus “usurp[ed] a market that properly belongs to the copyright holder.” Kirkwood, 150 F.3d at 110.”

Kaplan’s concurrence adds further texture:

“I am inclined to reject the idea that enhancing the efficiency with which copies of copyrighted material are delivered to secondary issuers, in the context in which the Watch function does so, is transformative..

These cases support my inclination to conclude that a technological means that delivers copies of copyrighted material to a secondary user more quickly, efficiently or conveniently does not render the distribution of those copies transformative, at least standing alone.

Nor does Google Books support the conclusion that efficiency-enhancing delivery technology is transformative in the circumstances of this case. Google Books, like this case, involved two features: a searchable database and the display of “snippets” from the books containing the search term. We held that copying the books to enable the search function had the transformative purpose of “identifying books of interest to the searcher.” That purpose was different than the purpose of the books themselves, which served to convey their content to the reader, and it constituted fair use. We held also that the snippets — “horizontal segment[s] comprising ordinarily an eighth of a page” — “add[ed] importantly to the highly transformative purpose of identifying books of interest to the searcher.” But Google Books does not resolve this case.”

So, given the complexity of distinguishing extraction from post-extraction expressive uses in the AI environment, would a broad exception for TDM have “an impact on potential licensing revenues for traditional, reasonable, or likely to be developed markets?” The answer would appear to be clearly yes. Again, this was nicely explored by Sobel in his 2017 piece, Artificial Intelligence’s Fair Use Crisis:

“Does training data for machine learning constitute a market that is traditional, reasonable, or likely to develop? Surprisingly, it often does. It is tempting to view machine learning as an alchemical process that spins value out of valueless data and creates a market where none previously existed. Considered individually, the bits of expression on which a machine learning model is trained are of infinitesimal value in comparison to the resulting model.”

He further observes:

“Whether one agreed or disagreed with how it operated, fair use was characterized as a redistributive mechanism that subsidized public pursuits at major content owners’ expense. Today’s digital economy upends this narrative. Today’s ordinary end users are not passive consumers of others’ intellectual property. Rather, they create troves of text, images, video, and other data that they license to large companies in exchange for gratis services. Powerful technology companies are now users of copyrighted material, and the companies’ end users are the rightsholders. This pivot in market dynamics should prompt a corresponding shift in attitudes towards fair use. The doctrine no longer redistributes wealth from incumbents to the public; it shifts wealth in the other direction, from the public to powerful companies.

Fair use redistributes economic and expressive power. It curtails an otherwise outsize legal and economic entitlement so that “the public” can undertake certain socially beneficial activities. If the doctrine develops to give carte blanche to expressive machine learning, it will redistribute in the opposite direction: it will serve the economic interests of incumbent firms at the expense of disempowered rightsholders.

The historical narrative of copyright and technology is one of powerful rightsholders and marginal users. Today’s tech business turns this structure on its head. Accordingly, scholars and jurists ought to recalibrate their intuitions about what fair use is and does. A progressive interpretation of copyright does not, in this circumstance, entail a broad construction of fair use. Indeed, upholding copyright’s redistributive roots may require a return to the market-based reasoning that, at the time, seemed to move against redistribution.”

Ironically, while I agree with so many of Sobel’s observations on the mechanics and equities of analyzing TDM in the AI environment given the increasing difficulty of differentiating between expressive and non-expressive uses, I fundamentally disagree with his proposal for consideration of the adoption of sui generis levies to compensate authors for the use of their works in the development of AI. If his thesis is correct that the expressive/non expressive distinction is becoming illusory, and I believe it is, then we risk placing a levy on a very fundamental aspect of copyright, and relegating copyright to protection of the past rather than to empower the future. Levies are too cumbersome and inflexible to respond to changes in the marketplace, and without the ability to allow opt-out, would be inconsistent with international law. Levies are best designed to mitigate prejudice, not to invite conflict with a normal exploitation of the work. Much better to consider a voluntary collective license, or failing that, an ECL with opt-out provisions, creating a licensing regime that is still voluntary (albeit encouraged) that can be easily modified as circumstances require and markets evolve. And of course, it remains to be seen whether we will have any market failure requiring intervention.

I close with a reference to a competing view presented by Amanda Levendowski in her article, “How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem.” She writes: “Using copyrighted works as training data for AI systems is not a substitute for the original expressive use of the works….The normative values embedded in the tradition of fair use align ultimately with the goal of mitigating bias. Fair use can, quite literally, promote creation of fairer AI systems.” I encourage everyone to read her piece, but believe that Levendowski’s legal conclusions are fundamentally unsound — both in resting on a strained and pinched view of the meaning of the fourth factor of fair use (the effect of the use upon the potential market for, or value of, the copyright work), as well as framing “fair use” as a referendum on the broader benefits of the use as opposed to an examination of the statutory principles. As discussed herein, even the father of transformativeness, Judge Leval, warns against over-reliance on transformativeness for determining whether a use is fair, noting that the effect on the potential market is ultimately the most important factor. Levendowski’s approach would virtually extinguish the right of authors to license derivative works (which are, by definition, transformative) and would lock copyright into the technology of the past. Films transform books. Recorded music transforms composition. Our world is transformed daily. Copyright must reflect that reality rather than ignore it.