ChatGPT: Learning or Stealing?

An exploration of the ethics of ChatGPT’s training practices

Ismail Habibi
Design Ethics
7 min readJun 3, 2024

--

Is the way AI “learns” really ethical? (Source: Adobe Stock)

Since its introduction in late 2022, ChatGPT has revolutionized many aspects of our lives. In schools, students use it to help write papers and summarize articles. In the workplace, programmers use it to draft code and automate workflows. At the writer’s desk, authors use it to brainstorm premises and generate settings. Whatever the context, ChatGPT has proven to be a useful tool for people to be more efficient, productive, and creative. This utility is largely due to the language model’s ability to respond to text prompts with detailed, coherent, and readable passages, which typically (though not always!) contain relevant and accurate information.

However, this capability did not manifest out of thin air — artificial intelligence models like ChatGPT must be trained in order to do what they are able to do. In the context of AI, training refers to a process in which a machine learning model is fed data, which it uses to make connections and recognize patterns.¹

Below is one depiction of such a process. The datasets that these models are trained off of contain large numbers of works, which often include copyrighted materials. The ethics of training AIs on such materials is a hotly contested debate, with arguments invoking various ethical frameworks and legal precedents. In this article, I use the ethical lenses of consequentialism and deontology to argue that it is ethical for ChatGPT to train from copyrighted materials because it is a transformative use of the works that contributes towards innovation and progress. I also propose the creation of an independent AI oversight board to ensure that AI continues to be developed ethically.

An illustration of ChatGPT’s training (Source: OpenAI)

How is ChatGPT trained?

First, it might be useful to go a little deeper into the specifics of how ChatGPT and other similar programs are trained. Essentially, ChatGPT is a more advanced version of the autocomplete features that are often found in mobile phone keyboards, email apps, and search engines. In order to string words together in ways that are coherent and relevant, ChatGPT must learn the patterns that dictate which words tend to come after others. It does this by taking in millions of text records and making associations between the words that it finds within them.² These text records can include books, news articles, Wikipedia entries, forum posts, and social media profiles. ChatGPT’s original training set comprised 570 GB of data, and later iterations of the model are trained on even more parameters.³ The original authors of the information that is used as training data are typically not credited, compensated, or asked for permission.

The New York Times v. OpenAI

The New York Times is suing OpenAI for using the newspaper’s work in training ChatGPT (Source: The Economic Times)

“This action seeks to hold them responsible for the billions of dollars in statutory and actual damages that they owe for the unlawful copying and use of The Times’s uniquely valuable works.” — From the NYT lawsuit against OpenAI

There has been much debate on the topic of artificial intelligence training as it relates to copyright law. At the end of 2023, The New York Times filed a lawsuit against OpenAI on the grounds that the artificial intelligence company used copyrighted NYT articles to train its chatbot with. Through such training, the Times alleges, OpenAI was able to steal away readers from the newspaper and redirect them to ChatGPT, where they could directly access the Times’s articles without having to pay a subscription fee. The Times included examples of ChatGPT responses in which the chatbot responded to prompts with passages that were verbatim copies of NYT articles.

In response to this lawsuit, OpenAI published a blog post on its website defending its use of articles from The New York Times in training AI models. The company claims that although they are not directly paying news sources for use of their work, the use of AI in conjunction with journalism can create greater opportunities to support news reporting by facilitating connections with readers and making workflows more efficient. Additionally, although the use of copyrighted material in training falls under fair use — “any copying of copyrighted material done for a limited and ‘transformative’ purpose”⁴ — OpenAI offers an opt-out option for organizations to prevent their materials from being used for such purposes. OpenAI also argues that ChatGPT regurgitating verbatim passages from training data is an uncommon bug of the training process that the company is working towards eliminating. The fact that The New York Times even encountered such responses from the chatbot, OpenAI says, may point towards the purposeful construction of prompts to result in very specific outcomes to support their case of copyright infringement.

“Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents.” — From OpenAI’s response to the lawsuit

Important ethical frameworks to know

To better understand and compare these arguments and others, it might be helpful to apply specific ethical frameworks. The frameworks that we will be using for our purposes are consequentialism and deontology, as these are among the two most common ethical lenses used to evaluate actions and are especially applicable to our present discussion. Consequentialism is an ethical theory that determines whether or not an action is right based on the consequences that result from that action.⁵ Under consequentialism, it would be morally acceptable to lie if telling that lie resulted in a definitively positive outcome, such as saving someone’s life. Deontology, in contrast, judges an action’s ethical correctness based on whether or not a person can be said to have a moral duty to perform that action in isolation.⁶ Under deontology, lying would always be seen as a morally bad act, regardless of any positive consequences of doing so.

Applying these ethical frameworks to ChatGPT

Let us first take a look at the arguments against AI training on copyrighted works. The first blow against this practice is the claim that it is wrong to use other people’s work without permission, credit, or compensation. This is a deontological argument — it appeals to the inherent morality of the act of training itself as analogous to the act of stealing. Regardless of how the works are being used, the mere fact that no input was sought from the original authors is enough to deem such an act morally incorrect under this view. There are also consequentialist approaches against such training — one of the primary arguments that The New York Times uses in its lawsuit against OpenAI is that ChatGPT’s use of NYT articles resulted in damages to the newspaper in the form of losses in readership. This argument appeals to the outcome of OpenAI’s training. Whether or not the actual act of training is inherently immoral does not matter here — what is relevant is the consequences of that training, which the NYT alleges are decidedly negative.

Consequentialism can also be used to defend ChatGPT’s training. OpenAI argues that the artificial intelligence language models that result from training on copyrighted material is sufficiently transformative to constitute fair use. Even though the training may have used copyrighted material, the end result of such training in the form of ChatGPT is something that is different enough from the original training materials that it does not pose an ethical issue. In a similar consequentialist vein is the argument that not limiting the data that ChatGPT trains on will lead to more innovation in the field of AI. By allowing language models the freedom to train on anything and everything, they will be able to learn and progress at a much faster rate, which could result in more prosperity and advancement for humanity as a whole.

Of these arguments, the consequentialist approach of proponents of AI seems more convincing under our ethical frameworks. The purpose of copyright, as written in the Constitution, is “to promote the Progress of Science and useful Arts.”⁷ This definition, then, appears to support the consequentialist idea that training on copyrighted material for the sake of technological advancement is morally acceptable. Taking the deontological approach and examining the training practices in isolation is not seeing the full picture of where such training leads. Similarly, OpenAI’s assurance that verbatim regurgitation of training materials is simply a glitch that is actively being patched entails that the consequentialist worry of directly stealing users away from the original materials is misguided.

How might we ensure that ChatGPT remains ethical?

The benefits of ChatGPT’s training practices outweigh the drawbacks, especially from a consequentialist perspective. Although there are relevant points to be made deontologically that are important to consider, they are unconvincing when compared to the consequentialist arguments that point to the very real and beneficial outcomes of this kind of training. As the development of AI continues, it is inevitable that similar ethical questions will persist, especially considering the recent advancements in AI image and video generation. Therefore, it is important to be actively and continually investigating the ethical implications of this technology moving forward. One policy that could be adopted to this end is that of transparency. If AI is not being limited in what it learns from, the companies developing these technologies should be open about what specific sources and documents this training includes. Such a practice would guarantee that any harmful or prejudiced content is caught and removed while also promoting accountability on the part of AI corporations. This could be done through the creation of an independent ethics oversight board, which would be responsible for ensuring that AI is developed ethically and used for ethical purposes. Such a board would be instrumental in helping develop ChatGPT and other AI models in equitable and beneficial ways.

ChatGPT’s training methods are ethical as of right now, but it is up to us to ensure that it stays that way as this new and innovative technology continues to progress.

Sources

  1. https://www.oracle.com/artificial-intelligence/ai-model-training/#what-is-ai-model-training
  2. https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-language-models-are-developed
  3. https://arxiv.org/pdf/2005.14165.pdf
  4. https://fairuse.stanford.edu/overview/fair-use/what-is-fair-use/
  5. https://iep.utm.edu/consequentialism-utilitarianism/#:~:text=Consequentialism%20is%20the%20view%20that,about%2C%20including%20the%20action%20itself.
  6. https://plato.stanford.edu/entries/ethics-deontological/
  7. https://constitution.congress.gov/browse/essay/artI-S8-C8-1/ALDE_00013060/#:~:text=Article%20I%2C%20Section%208%2C%20Clause,their%20respective%20Writings%20and%20Discoveries.

--

--