How does AI get “taught” — even small authors like me are caught up in it —

Marie Myung-Ok Lee
Asian American Book Club
3 min readSep 25, 2023

--

These 183,000 Books Are Fueling the Biggest Fight in Publishing and Tech

The Atlantic magazine wrote this terrific piece about how AI trains the algorithm and to my surprise, I found my book in it! And my husband did too —

AI is such a problem for me as a college professor and now we authors and writers (formerly worrying about Amazon) have to worry about this now, too. This is the problem with “disruptive” technology — they break things and don’t care. We should be compensated for this theft, but do they care? Do they care about deep fakes, corruption of language?

I don’t know what to do about it, besides not buying books that are obviously done by AI. Anyone have any better ideas?

Editor’s note: This searchable database is part of The Atlantic ‘s series on Books3. You can read about the origins of the database , and an analysis of what’s in it here.

This summer, I acquired a data set of more than 191,000 books that were used without permission to train generative-AI systems by Meta, Bloomberg, and others. I wrote in The Atlantic about how the data set, known as “Books3,” was based on a collection of pirated ebooks, most of them published in the past 20 years. Since then, I’ve done a deep analysis of what’s actually in the data set, which is now at the center of several lawsuits brought against Meta by writers such as Sarah Silverman, Michael Chabon, and Paul Tremblay, who claim that its use in training generative AI amounts to copyright infringement.

Since my article appeared, I’ve heard from several authors wanting to know if their work is in Books3. In almost all cases, the answer has been yes. These authors spent years thinking, researching, imagining, and writing, and had no idea that their books were being used to train machines that could one day replace them. Meanwhile, the people building and training these machines stand to profit enormously.

Reached for comment, a spokesperson for Meta did not directly answer questions about the use of pirated books to train LLaMA, the company’s generative-AI product. Instead, she pointed me to a court filing from last week related to the Silverman lawsuit, in which lawyers for Meta argue that the case should be dismissed in part because neither the LLaMA model nor its outputs are “substantially similar” to the authors’ books.

It may be beyond the scope of copyright law to address the harms being done to authors by generative AI, and the point remains that AI-training practices are secretive and fundamentally nonconsensual. Very few people understand exactly how these programs are developed, even as such initiatives threaten to upend the world as we know it. Books are stored in Books3 as large, unlabeled blocks of text. To identify their authors and titles, I extracted ISBNs from these blocks of text and looked them up in a book database. Of the 191,000 titles I identified, 183,000 have associated author information. You can use the search tool below to look up authors in this subset and see which of their titles are included.

Before you begin, please note several caveats: Some books appear multiple times, reflecting different editions, translations, abridgments, or annotations. Because of inconsistencies in the spelling of author names, the search may not return books that are, in fact, in Books3. It may also deliver a jumble of odd formatting: A query for Agatha Christie will also return books labeled Agatha Christie and Christie Agatha, for example. And because of possible errors in the book-identification process, which involves detecting an ISBN within the text of the books and using a book database to find their author and title, there is a very small chance of false positives.

Originally published at https://www.theatlantic.com on September 25, 2023.

--

--

Marie Myung-Ok Lee
Asian American Book Club

Editor of Asian American Book Club. Novelist, essayist, Columbia prof. Personal writing can be found at @MarieMyungOkLee Twitter/Insta also @MarieMyungOkLee