Unknown Artist, Havana, Cuba. Photo ©2022 Neil Turkewitz

Searching for global copyright laws in all the wrong places: An examination of the legality of OpenAI’s data scraping

8 min readDec 13, 2022

by Neil Turkewitz

I was reading the complaint recently filed against GitHub, Open AI & Microsoft in relation to the unauthorized use of computer code to train AI, as one is wont to do, and came across this “explanation” by GitHub (paragraph 84 of the Complaint):

“Training machine learning models on publicly available data is considered fair use across the machine learning community . . . OpenAI’s training of Codex is done in accordance with global copyright laws which permit the use of publicly accessible materials for computational analysis and training of machine learning models, and do not require consent of the owner of such materials. Such laws are intended to benefit society by enabling machines to learn and understand using copyrighted works, much as humans have done throughout history, and to ensure public benefit, these rights cannot generally be restricted by owners who have chosen to make their materials publicly accessible.”

I’m not sure who is advising GitHub, but the suggestion that the unauthorized use of “publicly available data is consistent with global copyright laws” is a fantastical claim, for any number of reasons, and that’s even before addressing the ridiculous notion that machines learn “much as humans have done throughout history.”

Oh, where to begin? So many choices, so little time. I suppose the most logical point is to note that there is no such thing as “global copyright laws.” There is a fair amount of harmonization of national copyright legislation through treaties negotiated in fora such as the World Intellectual Property Organization (WIPO) and the World Trade Organization (WTO), but copyright legislation as such remains the jurisdiction of sovereign states. Is GitHub suggesting that they have reviewed the national legislation of every state and determined their practices are consistent with all requirements? Surely not. I assume GitHub is here referring to international treaties rather than copyright legislation, but that claim doesn’t fare any better. You see, rights under copyright treaties have, to a fair amount, harmonized the level of minimum rights that countries must provide. But what they haven’t harmonized, much to the chagrin of many copyright minimalists, are the exceptions to be applied to such required rights. Indeed, the area of exceptions is left to the national legislator, subject to the overriding principles set out in the Berne Convention, and recited in the WTO TRIPS Agreement, the WIPO Copyright Treaty (WCT) and the WIPO Performances and Phonograms Treaty (WPPT). Under these binding treaties, Parties may provide for certain exceptions under delineated general conditions, generally known as the three part test. To be compatible with international law, exceptions must:

1-apply only in certain special cases;

2-not conflict with a normal exploitation of a work; and

3-not unreasonably prejudice the legitimate interests of the author.

Thus, in order to determine whether any national legislation providing exceptions, including exceptions like fair use, to rights otherwise guaranteed under international treaties is consistent with international obligations, one would need to analyze such exceptions with regard to these three conditions. A thorough analysis of each of these provisions in relation to a putative exception permitting unauthorized use for the purpose of “machine learning” would require at least three separate essays (or one standard length law review article I suppose), but that is not my present intention. But there is one point I would like to stress, and that is that one should be exceedingly wary about claims that machine learning is a “special case,” or that it doesn’t conflict with a normal exploitation of the work. Dr. Mihaly Ficsor, former Assistant Director General of WIPO and head of its Copyright Division, leaves no doubt about the meaning of the conditions imposed by international treaties, including this which is highly relevant to our present discussion:

“The second criterion is that an exception or limitation must not conflict with a normal exploitation of (rights in) works. There is no dispute on that “exploitation” means extraction of the economic value of rights. As the documents of the negotiation history confirm, “normal exploitation” is both an empirical and a normative concept. It means “all forms of exploiting a work which [has], or [is] likely to acquire, considerable economic or practical importance.”

So as we proceed, please bear in mind Dr. Ficsor’s guidance — exceptions must not conflict with any “form of exploiting a work which [has], or [is]likely to acquire, considerable economic or practical importance.” So I ask you, dear reader, whether you think that use of copyright-protected materials to train AI is limited in ways contemplated by international law. I submit that it is not. Indeed, to my mind, allowing unauthorized use of copyright works to train AI fails all three conditions: it is not a special case; it conflicts with a normal exploitation of the work by facilitating the preparation of derivative works which will compete against the original works from which it derived; and it prejudices the legitimate interests of the author.

Furthermore, the claim that “OpenAI’s training of Codex is done in accordance with global copyright laws” straight out ignores not only the international framework for exceptions outlined above, but the contours of specific legislation currently in place. The EU acquis under Articles 3 & 4 of the 2019 Directive on Copyright and Related Rights in the Digital Single Market (DSM) is particularly on point. I analyzed these provisions in some depth here, but in short, the EU prohibits general text & data mining for training AI except in very limited circumstances (scientific research) or only when certain conditions are met — i.e. a mechanism for opt-outs.

The present legislation of the UK is even more lethal to OpenAI’s claim of legality. Art. 29A to the Copyrights, Designs and Patents Act (CDPA), concerning ‘copies for text and data analysis for non-commercial research’ reads:

(1) The making of a copy of a work by a person who has lawful access to the work does not infringe copyright in the work provided that —

(a) the copy is made in order that a person who has lawful access to the work may carry out a computational analysis of anything recorded in the work for the sole purpose of research for a non-commercial purpose, and

(b) the copy is accompanied by a sufficient acknowledgement (unless this would be impossible for reasons of practicality or otherwise).

(2) Where a copy of a work has been made under this section, copyright in the work is infringed if —

(a) the copy is transferred to any other person, except where the transfer is authorised by the copyright owner, or

(b) the copy is used for any purpose other than that mentioned in subsection (1).

As such, not only are OpenAI’s practices not “done in accordance with global copyright laws,” but they directly conflict with laws in the EU and UK (and arguably in the US too, see below).

In addition, it is not irrelevant to observe that even those parties that would like to legalize TDM of copyright protected materials understand that such text and data mining would require an exception, and would therefore violate existing legislation where such an exception is lacking. Interested parties are lobbying governments to create exceptions to allow TDM — something that this particular author strongly opposes, but which reinforces the restrictions imposed by the status quo.

The present GitHub claim of non-infringement with respect to the unauthorized use of copyright works for the purpose of training AI closely tracks a submission made by the Software Alliance (BSA) to the US Patent & Trademark Office, so it seems logical to recite my reaction to them at the time of their submission:

“While I tend to see the world in a somewhat different way than they [BSA] do, I respect their views which I generally find to be nuanced, well-considered and expressed, even when I disagree. They have submitted various briefs on the issue of fair use, including in the Google v. Oracle case headed for the Supreme Court, setting forth a measured position on the issue of fair use, simultaneously embracing its importance, while warning against overbreadth. Their 2017 amicus brief before the Court of Appeals for the Federal Circuit captured it perfectly:

“The Court should also ensure that courts applying fair use defenses to infringement in software cases do so correctly. Fair use may be important in various circumstances, but it should not be interpreted so broadly as to swallow the commercial value of an infringed underlying work by failing to fully and carefully weigh all four of the factors set out in 17 U.S.C. § 107. .. Courts recognize uniformly that evaluating the fair use defense is case-by-case and fact-driven. It is not to be simplified with bright-line rules. Harper & Row Publ’rs v. Nation Enters., 471 U.S. 539, 560 (1985).“

It was thus with great consternation that I confronted the lack of restraint manifested in their recent submission on AI in which they essentially argue that the national imperatives in the race for AI dominance justify an expansive view of fair use that would only limit use of preexisting materials where the output was perceptively infringing. In short, while they nominally eschew any rigid test for fair use in all instances, their position emerges quite clearly — fair use will allow any use of copyright works as long as the expression of the resulting work produced by AI isn’t substantially similar to the works on which it was “trained.” I quote: “creating a database of lawfully accessed works for use as training data for machine learning will almost always be considered non-infringing in circumstances where the output of that process does not compete with the works used to train the AI system.” I think this is fundamentally wrong — both as a matter of law and as a matter of justice (with the latter being infinitely more important).

If current narratives are to be believed, the future of writing, singing, composing etc. will increasingly be in the hands of machines. Now of course, machines don’t have hands, but nor do they have creativity. The works ingested by the machine are the raw data by which the machine becomes capable of reconfiguring words, symbols, notes etc. into new works. They are not “reading” as such — a point I highlight here because it has copyright implications as well as moral ones. BSA likens machine “learning” to how a human might ingest a book, combing through the protected expression while retaining the unprotected ideas. But while a human might very well operate in that manner, it’s a terrible stand-in for the operation of machines which by their very nature “learn” through reproduction, with such reproductions forming the basis of any new output. Those reproductions of expression, however temporary, are the raw materials used for the development of new forms of expression. In other words, AI isn’t just inspired by the works it ingests — it owes its very existence to them. As such, the notion of ingested works lacking economic or cultural significance as proposed by BSA couldn’t, in my view, be more incorrect. AI is the distillation of that which went before, and as such, depends on the past for all of the potential value it may create.”

My modest proposal? How about we allow creators to determine how, or whether, their works are used in establishing the conditions of a digital world in which they are the suppliers of the raw materials. In what universe do we believe it’s fair to exclude them from the fruits of their labor? To my mind, it’s clear — there is no such universe, or at least I hope there isn’t. Safeguarding consent is the right thing to do, both morally and legally.

Searching for global copyright laws in all the wrong places: An examination of the legality of OpenAI’s data scraping

Written by Neil Turkewitz