GitHub’s Copilot Lawsuit Raises Concerns About Generative AI and Fair Use

Rameses Neale
5 min readDec 4, 2022

--

A class-action lawsuit against GitHub’s Copilot may shape the future of how generative AI reference human works.

source: Midjourney

As generative AI platforms increasingly offer new playgrounds for creativity, ethical concerns about how these tools train on and incorporate existing human work surmount. This November saw the first class action lawsuit challenging the training and output of such AI systems, potentially disrupting the fun of a booming era of AI capable of generating remarkable images and videos, coherent text, and working computer code.

Copilot is an AI-powered coding assistant that auto-suggests snippets of functioning code to programmers in real-time, helping software developers “spend less time creating boilerplate and repetitive code patterns, and more time on what matters: building great software.”

The GitHub-owned tool is trained on open-source code from public repositories on GitHub, the reproduction of which is limited by attribution requirements.

Early November, programmer and lawyer Matthew Butterick filed a class action lawsuit in a San Francsiscio federal court against Microsoft, its subsidiary GitHub, and its business partner OpenAi, alleging that the AI-powered coding assistant “ignores, violates, and removes the licenses offered by thousands — possibly millions — of software developers, thereby accomplishing software piracy on an unprecedented scale.”

Though in its early stages, the lawsuit is expected to have a lasting impact on the generative AI world. With a rapid uptick in other generative AI systems, the legality of how these systems are similarly trained on and reproduce copyrighted material will have to be hashed out. Butterick’s and similar litigation should prove instructive.

The intersection of copyright law and artifical intelligence is largely unexplored, with the latter evolving much faster than the former can keep up. While the firms behind these programs maintain that their use of data comports with US fair use doctrine, legal experts say the law is far from settled, only to be determined as novel technologies face legal examination.

A central factor in the Copilot lawsuit is whether the AI’s reproduction of unattributed open-source code constitutes fair use.

source: Midjourney

Fair Use?

There’s no question that humans can legally build upon the works of other humans, within the bounds of fair use. But is it legal for AI to do the same?

The fair use doctrine is intended to better define how one can use and build upon copyrighted works without permission of the copyright owner and without unfairly depriving them of the right to control and benefit from their copyrighted works.

The test for fair use traditionally considers four factors:

  1. the purpose (e.g., commercial or nonprofit educational) and character (i.e., transformative or non-transformative) of use;
  2. the nature of the copyrighted work;
  3. the amount and substantiality of work reproduced; and
  4. how the use affects the potential market for or value of the copyrighted work.

It’s doubtful whether Copilot’s reproduction of open-source code constitutes fair use.

Purpose and Character of Use

The court will need to determine whether Copilot’s purpose and character of use of copyrighted open-source is fair. The lawsuit alleges that GitHub and OpenAI intentionally designed Copilot to profit off of license-protected open-source code at the expense of a global open-source community they claim to foster and protect.

GitHub was founded in 2008 to host open-source source code, protecting users with license requirements. Copilot was luanched by GitHub and OpenAI in 2021 to assist software coders by auto-providing them code.

GitHub charges Copilot users $10 per month or $100 per year to use the service. But the code Copilot guards behind its paywall and reproduces can be traced back to GitHub’s free open-source repositories. Thus, the lawsuit alleges, GitHub is violating its own open-source licensing and attribution requirements while profiting from the code they claim to protect.

The court may also consider whether the nature of Copilot’s use of open-source code is transformative, adding something new to it. The application of open-source code to an AI assistant may use the code in a new way and to a further purpose, making coding more accessible by (half-time) quote. This could constitute a sufficiently transformative application of the copyrighted code to constitute fair use, but this will likely be overshadowed by the commercial purpose of the use.

Nature of Copyrighted Work

A less important factor to be considered is the nature of the copyrighted work. The open-source code published to GitHub is protected by normal license and attribution requirements that circumscribe its use. Though being published widens its scope of use, the clear license requirements narrow down the bounds of fair use.

It takes some creativity to create working code and there are numerous ways to write code that performs the same function, both of which limit the possibility of fair use. Copilot eliminates the need for this creativity by copying and auto-suggesting code, which likely doesn’t constitute fair use.

Amount and Substantiality

The court will also have to consider the amount and substantiality of copied open-source code. This will require considering factors such as what percentage of the copyrighted code is present in a total output and whether Copilot ordinarily reproduces passages of code.

GitHub concedes that an output “may contain some code snippets longer than ~150 characters” of code from the training data about 1% of the time. Though the 150-character threshold is likely too limited for copyright infringement consideration, the lawsuit alleges that, on a conservation estimate, Copilot has violated the DMCA 36,000 times. The scale of minor breaches is compelling.

Effects on the Market For or Value

Lastly, the court will have to consider whether Copilot’s use of the open-source code already has or likely will adversely affect the market for the copyrighted works. The answer will likely be yes.

Though owners of open-source code don’t necessarily or directly financially benefit from open-source licenses and attribution, Copilot’s use of their copyrighted work supplants the copyrighted works in the market, while GitHub profits from violating license agreements. By reproducing their works without proper licensing and attribution, Copilot renders the owners of copyrighted code unable to identify and control the reproduction and distribution of their copyrighted works, have the terms of their open-source licenses followed, and pursue copyright-infringement remedies. This disruption of prospective contractual relations will inevitably result in monetary damages.

Closing Thoughts

Apart from fair use considerations, the federal court will have a wealth of novel considerations to make — among them is determining how and to what extent traditional copyright law applies to non-humans. As this lawsuit progresses and similar ones arises, we will start to get a clearer picture of how copyright law will define the future of generative AI.

--

--

Rameses Neale
Rameses Neale

Written by Rameses Neale

Exploring the impacts of Artificial Intelligence & Technology on Humanity | BA Philosophy, Politics, & Law, UC Berkeley