The Copyright problem with Generative AI

Manav Gupta
5 min readAug 31, 2023

--

Images generated from midjourney

I had multiple clients recently ask me questions along the following lines:

  1. Who owns the generated IP?
  2. What protections are available to shield ourselves from the content generated by LLMs?

Generative AI systems like image and text generators are trained on massive datasets, raising legal concerns around copyright and trademark infringement.

There are open questions around whether AI-generated content constitutes derivative works, and who owns the copyright on outputs.

Recent legal developments have highlighted this complex relationship between IP and generative AI. In January 2023, artists initiated a class action suit, Andersen et al v. Stability AI, Midjourney, DeviantArt, alleging that generative AI art tools infringed on copyright by scraping artists’ work from the internet without permission. Getty Images filed a lawsuit against Stability AI for the alleged unauthorized use of over 12 million images from Getty’s database for AI training. These cases represent just a glimpse of the growing number of legal disputes emerging in this area.

Risks for businesses include potential infringement liability if systems use unlicensed data, and uncertainty around ownership of AI-generated content. Unfortunately, the legal landscape is still evolving and continued court cases will shape how generative AI fits into intellectual property rules.

Let’s break down some of the key aspects of the copyright related problem in generative AI:

  1. Source Material Ownership:
    — LLMs are trained on vast amounts of text sourced from the internet. While efforts are made to use datasets that are allowed for such uses, the sheer scale means that copyrighted material might inadvertently be included. For example, LLaMA & LLaMA-2 from meta use a mix of publicly available dataset, including Common Crawl, C4, etc.
    — If the AI produces text that closely mirrors the copyrighted text it was trained on, it raises questions about infringement and originality.
  2. Generation of Copyrighted Content:
    — If the AI generates content that is similar or identical to copyrighted works (e.g., generating a line from a copyrighted song or book), it could potentially infringe on the copyright of the original work.
  3. Ownership of Generated Content:
    — It’s not always clear who owns the rights to content generated by an AI. Is it the developers of the AI? The user who prompts the AI? Or is it not copyrighted at all given its machine-generated nature?
  4. Potential for Unfair Use:
    — Generative AI can be used to produce articles, stories, music, or artwork. If someone sells or profits from this content without attribution or licensing, it can be seen as unfair use, especially if it closely resembles existing copyrighted work.
  5. Replacement of Creative Labor:
    — If AI can generate content that replaces the need for human creators, it can undercut the economic model that supports creative professions. While this is more of an economic issue than a direct copyright problem, the two are interlinked, as copyright laws are designed to protect and incentivize human creativity.
  6. Detection and Enforcement:
    — As AI-generated content becomes more sophisticated, it can be challenging to distinguish it from human-created content, making it difficult to detect potential infringements or ensure originality.
  7. Moral Rights:
    — In some jurisdictions, copyright isn’t just about economic rights but also moral rights, which include the right to attribution and the right to the integrity of the work. Generative AI can blur these aspects, as the outputs aren’t the result of a single human’s creative vision.
  8. Fair Use and Transformative Works:
    — AI can generate content that might be seen as transformative, which in some cases could be protected under the doctrine of fair use. However, where the line is drawn between infringement and transformative work in the context of AI is still a gray area.

In conclusion, the intersection of generative AI and copyright is an emerging area of legal and ethical concern. As technology advances, it will be imperative for copyright law and practices to evolve in tandem to address these challenges while ensuring that creativity and innovation continue to flourish.

So what can enterprises do?

  • Update contracts to address AI use, asking vendors about data sources, adding confidentiality measures.
  1. Develop contracts to address the use of LLMs — asking vendors about data sources, adding confidentiality meaures. This includes forcing obtaining clarity on source-material ownership, and thus only use LLMs where the LLM provider can provide lineage of the datasets. Even better, if they can indemnify the clients against third party IP claims arising from the LLM vendor’s breach of copyright or other intellectual property rights..
  2. Carefully review model information card: Model cards are markdown files that accompany the models and provide handy information. Model cards are essential for discoverability, reproducibility, and sharing.
  3. Favour models that use open source or public domain datasets: Leveraging datasets that are explicitly open-sourced or in the public domain ensures that there are no copyright restrictions on their use. However, it’s crucial to understand the specific terms of the license. Creative Commons licenses provide a range of options for open content.
  4. Obtain Explicit Permission: If using proprietary data or datasets with ambiguous licenses, always seek explicit permission from the data owner or creator. This is not always feasible since the LLMs may not disclose all the sources, and there may be too many sources. However, there could be a situation where industry/task-specific LLMs may have a handful of data source owners.
  5. Vendor Due Diligence: Before entering into a contract with a vendor, conduct a thorough due diligence process. Inquire about the origin of their datasets, any permissions or licenses they have obtained, and their processes for ensuring copyright compliance
  6. Anonymize and Aggregate Data: Where possible, use anonymized and aggregated data, which might reduce copyright concerns, especially when data consists of potentially identifiable or proprietary information.
  7. Engage in Community Standards: Participate in industry forums, and contribute to setting standards on the ethical and legal use of data. Engaging at this level can ensure a company is at the forefront of best practices.
  8. Transparency Reports and Audits: Request transparency reports or audits from the vendor detailing the sources of their training data and any copyright checks they have performed. This is in its infancy, but I suspect auditing & consulting firms may start acting as an independent auditor who will evaluate a vendor’s security & provenance posture in training / fine-tuning of LLMs.
  9. Purchase Insurance: Consider purchasing insurance that covers intellectual property infringements, including copyright claims, which arise from the use of third-party technology or data.

--

--