How To Navigate The Uncharted Water(marks) of Copyright Law and AI

A deep dive into watermarking technology and its relevance to the AI/Copyright debate.

A.G. Elrod
Predict
7 min readSep 30, 2023

--

A zoomed in view of a futuristic $20 US currency note featuring the face of an android rather than Andrew Jackson.
A currency note featuring the face of an AI. DALL-E 3

On August 30, the Copyright Office of the Library of Congress initiated a public call for commentary regarding the intersection of copyright law and artificial intelligence (AI). This discussion includes (1) the utilization of copyrighted works to train AI models, (2) the degree of transparency and disclosure relating to the use of copyrighted materials, and (3) the rights associated with AI-generated content. Stakeholders have a deadline of October 18, 2023, to submit their written opinions, while responses are due by November 15, 2023.

The discussion around the first two issues concerning the use of copyrighted materials in training AI systems is well-documented. More intriguing, however, is the relatively unchartered territory of the third issue: Do AI-generated materials possess any rights? In practical terms, can content generated by AI systems qualify for copyright protection? There have been multiple instances where the Copyright Office has received applications to register works featuring AI-generated content, sometimes even naming AI systems as authors or co-authors. With current law mandating human authorship for copyright eligibility, the boundary between human creation and AI-generated content appears increasingly blurred.

Undoubtedly, when the US copyright law was instituted by the US Constitution in 1790, the concept of machines emulating human creativity was unimaginable. New technologies often challenge the limitations of established laws. This could be the most consequential technological revolution since the Constitution’s inception, and the two-month period given for addressing this unprecedented issue may seem insufficient.

Despite the time constraints, it’s vital to recognize that copyright is fundamentally tied to authorship. If authorship cannot be substantiated, copyright protection becomes irrelevant. As per the Constitution, copyright exists to

…promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries” (Article I, Section 8, Clause 8).

Therefore, if the author remains completely anonymous, there’s arguably no right to protect.

While this may be a challenging legal problem, addressing it directly is putting the legal cart before the technological horse. Copyright can’t be effectively established without definitive authorship. Currently, a significant portion of AI-generated content is virtually indistinguishable from human-created work. This leads to a dilemma where individuals might claim AI-produced (or partially AI-produced) content as their own, often reaping financial benefits from such actions.

Take, for instance, the surge in AI-generated books published via Amazon’s self-publishing platform or the recent Writer’s Guild of America strike, which is due in large part to the perceived existential threat AI represents to the industry.

Until we can incontrovertibly determine that a piece of content is AI-generated, debates about its associated rights may be premature and inconclusive.

Watermarking

As long as there has been a need for verifiable authenticity, there has been the concept of watermarking. Essentially, watermarking involves embedding a unique, hard-to-replicate “mark” into a work. This strategy is most commonly observed in national currencies, which feature various types of watermarks to thwart counterfeiting attempts. This is crucial because when counterfeits proliferate, the objects that they imitate lose faith. With the loss of faith, there is a loss of value.

Drawing from this understanding, if we assert the importance, value, and need for protection and compensation of content creators, our focus should first be on technological means to authenticate the creator’s identity. Many contend that watermarking AI-generated content is a crucial solution to the problem of imitation or “counterfeit” work. The technology required to watermark AI-generated content already exists; we merely need a consensus among stakeholders for widespread implementation.

Several methods for watermarking have been proposed, each varying depending on the type of content. The task becomes particularly challenging when it comes to watermarking generative text from systems like OpenAI’s ChatGPT, Google’s Bard, or Anthropic’s Claude. However, two promising methods with published research stand out: the Unicode Method and the Token Selection Method. What follows is an in-depth investigation of each.

Unicode Method

Recently suggested by Alistair Croll, key concepts behind the Unicode Method can be traced back to a 2002 scientific paper on the potential risk of a “homograph attack.” Simply put, this method involves replacing random letters in AI-generated text with visually identical counterparts that a computer interprets differently. To understand this strategy, it’s essential to understand the fundamentals of the Unicode standard.

The Unicode standard was established to ensure consistent representation and handling of text across diverse languages and platforms. It assigns a unique code point to each character from virtually all known writing systems, be it letters, numbers, symbols, or emojis.

Unicode operates by using two to four bytes to represent each character. When you type a letter, the computer perceives it not as a graphic symbol but as a unique code point. Unicode, in essence, is an expansive set of addresses (more than 1.1 million) that computers interpret and display as characters on our screens.

Due to its global adoption, Unicode ensures that text can be transferred, manipulated, or displayed across different devices or operating systems without loss or corruption. This universality is what makes the Unicode Method of watermarking appealing.

Here’s a practіcal example from a paper tіtled “A Watermark for Large Language Models” by Kіrchenbauer et al., 2023. When you type the letter “і”, the computer іnterprets іt as U+0069, the Unicode address for the Latіn letter. However, іf the computer dіsplays the Cyrіllіc letter “і”, іndіstіnguіshable to a human eye, іt uses a dіfferent address, U+0456. By sporadіcally replacіng characters wіth vіsually іdentіcal alternatіves, the text could be watermarked as AI-generated. Chances are, you dіdn’t even notіce that every “і” іn thіs paragraph was replaced wіth the Cyrіllіc character.

Nevertheless, the solution isn’t foolproof. For instance, spell and grammar checkers might flag words as misspelled due to the inclusion of alternate characters. This issue could potentially be addressed in a similar fashion to how scanners and copiers refuse to duplicate currency. Manufacturers were required to adopt a standard that made counterfeiting more challenging. Similarly, software companies could be required to integrate the use of alternate characters seamlessly into their algorithms.

Another concern is software that identifies and replaces these alternate characters, which even a beginner programmer could create. An imperfect solution might involve modifying compilers to detect this code. In the end, this may only amount to an inconvenience for anyone willing to employ automated character re-addressing software to bypass this watermark.

Another, arguably more robust, solution was proposed in the previously cited paper by Kirchenbauer et al., involving the random, selective tokenizing of AI-generated text output.

The Token Selection Method

Understanding this method requires a basic knowledge of how Large Language Models (LLMs) like ChatGPT operate. Essentially, LLMs are highly advanced prediction models. They evaluate hidden patterns and apply certain rules to text selection, generating text that anticipates what’s likely to follow. To some extent, that is what we as humans do. We take in patterns and guess at what is the next likely outcome. If I were to write, “We the people of…,” your mind would probably automatically jump to “the United States.” You know the pattern; you have seen it before. LLMs have analyzed billions of textual patterns, making them exceptionally proficient predictors.

The token selection method leverages this predictive capacity for watermarking purposes. Since an LLM can anticipate several potential words to follow in a text, it has multiple options to choose from. For instance, consider the LLM completing the sentence, “Every morning, regardless of the weather, John puts on his sneakers, opens his front door, and starts to…”. The model might propose several options (run, jog, exercise, walk, or stretch.) For illustrative purposes, let’s assume these are listed from most to least probable. If “run” and “jog” are virtually indistinguishable in terms of probability, the LLM could select either without significantly impacting the text’s tone or quality. In any AI-generated text, numerous such opportunities arise. By dividing vocabulary (or more accurately, “tokens”) into lists of permitted and prohibited words, text generation could adhere to a rule whereby the use of permitted words against prohibited words is statistically significant, even in a small text sample.

Consequently, a human-written text would contain a higher percentage of prohibited words. As explained in the paper, this is how an algorithm designed to identify this watermark might interpret it:

Even with this small text sample, the algorithm can identify the watermarked, AI-generated text with an “extreme” level of certainty. This method might be simpler to implement than the Unicode method while providing a watermark that is significantly more challenging to circumvent.

Conclusion

As we navigate through this new era characterized by the rapid advancement of AI, we are confronted with new and complex challenges. Among the most urgent is the issue of copyright law, originally drafted over two centuries ago, with no provision for non-human authorship. The application of this law rests on the accurate attribution of authorship, a line that becomes blurred when we introduce advanced generative AI. Amidst fervent debates and looming deadlines, we find ourselves at a crossroads, where discussions on AI rights become redundant without a reliable method of determining origin.

As we’ve explored, a viable solution lies in the centuries-old practice of watermarking, adapted to suit our digital era. While these proposed watermarking techniques aren’t perfect, they offer a guiding light in navigating this unexplored territory. They could, with a high degree of certainty, ensure the correct designation of AI-generated content, paving the way for meaningful dialogue on copyright law.

With the Copyright Office’s deadline fast approaching, the need to implement these watermarking methods is more pressing than ever. Without them, we risk entering an era where our knowledge economy is undermined by counterfeit creations, devaluing authorship and jeopardizing the integrity of knowledge. The framers of the Constitution understood that copyright protections were not just for economic benefit but pivotal for the progression of science and the arts. As we wrestle with the profound implications of AI, we should heed their wisdom and ensure our responses maintain the spirit of progress they intended to nurture.

--

--

A.G. Elrod
Predict

International educator and researcher of AI Ethics and Digital Humanities: I believe in looking at today's innovations through the lense of ancient wisdom.