© 2018 Neil Turkewitz

OpenAI releases “Our approach to content and data in the age of AI.” Let’s look under the hood

Neil Turkewitz
5 min readMay 7, 2024

--

by Neil Turkewitz

OpenAI released a document earlier today entitled: “Our approach to content and data in the age of AI.” In it, OpenAI sets out a new initiative to theoretically align their products with the wishes of creators/copyright owners in terms of use of their works to train AI models. While I of course am hopeful that this initiative will be more than merely an exercise in public relations, I must admit that I am skeptical. Even in announcing this putatively new direction, OpenAI’s own words suggest that no major change is in the offing. I hope I am wrong…but the signs are not promising. Let’s dive in.

AI tools like DALL·E and Sora (currently in research preview) are empowering creatives from aspiring artists to filmmakers.

That’s the description of the status quo? If you talk to the creative community, I can assure you that’s not how they would describe these products. Right off the bat, OpenAI is spinning a narrative which in this case is particularly odd — why defend one’s record in announcing what’s dubbed a major change? It makes little sense. This detachment from reality continues throughout the document, including in the following:

We’re continually improving our industry-leading systems to reflect content owner preferences.”

“Continuing to improve” isn’t an acknowledgement of present injustice. Indeed, elsewhere in the document OpenAI alleges it is under no legal obligation to refrain from using creative works without consent, suggesting it is only their magnanimity leading them to adopt a new path. This is not the language of a company dedicated to meaningful change.

OpenAI continues: “We are not professional writers, artists, or journalists, nor are we in those lines of business. We focus on building tools to help these professions create and achieve more. To accomplish this, we listen to and work closely with members of these communities, and look forward to our continued dialogues.”

Note that this is all framed in the present tense — suggesting that the company’s present practices reflect their kinship with cultural workers. This is insulting…or at least, I am insulted, and I’m not even directly affected.

This continues in text bolded in their document:

We respect the choices of creators and content owners on AI.

How exactly is that respect currently manifested? If they can characterize present practices in this way, why would creators — or anyone else — have any belief that OpenAI means to introduce major changes in its operating practices?

This all leads to OpenAI’s announcement —

We need an efficient, scalable solution for content owners to express their preferences about the use of their content in AI systems.”

Okay, but do we really need a new solution? What about existing ones? You know, like a commercial operator obtaining the consent of creators prior to use? That’s not complicated. Independent creators interested in licensing their works for inclusion in training sets could easily set up voluntary collective licensing organizations, or could vest authority in existing ones. OpenAI could make direct deals with large corporate owners as they have begun to do. The truth is, we don’t need any new solution — we just need a change in the underlying business paradigm. And this is where we get to the huge disconnect. Sam Altman has previously commented that it would be prohibitively expensive to license all of the creative works used for training. Altman/OpenAI don’t want to align their practices with the contours of creators’ consent — they fear that would bankrupt them. They merely want to create an illusion of consent. That’s infinitely cheaper.

By the time they unveil the basic framework, they are undoubtedly hoping that no one is paying attention to details anymore. They write:

OpenAI is developing Media Manager, a tool that will enable creators and content owners to tell us what they own and specify how they want their works to be included or excluded from machine learning research and training.”

So here it is. The brilliant game-changing plan. Develop a tool that allows creators to identify what they own. But here’s the thing — that’s not how copyright works. It is not up to copyright owners to tell potential users what they own. It is for users to clear relevant rights prior to use — especially commercial uses. The very predicate for their Media Manager is legally and morally suspect. We don’t really need novel solutions — we need the application of traditional business principles and the application of law. The alleged complexity evaporates.

In the interest of time & attention, I will end here, but I do highlight the following excerpts in which OpenAI continues to try to complicate that which is not so complicated.

After the training process is complete, the AI model does not retain access to data analyzed in training.”

Me: And? What difference does that make? It gets what it needs and discards the rest. It stores what it needs to generate output. The rest is irrelevant.

ChatGPT is like a teacher who has learned from lots of prior study and can explain things because she has learned the relationships between concepts, but doesn’t store the materials in her head.”

Me: Wild anthropomorphic formulations are always a red flag that what comes next is a baseless analogy, and this one doesn’t disappoint. It’s mid-2024 and OpenAI is still trotting out the “AI learns like a human?”

Our models are designed to help us generate new content and ideas — not to repeat or “regurgitate” content.”

Me: There are countless examples of models regurgitating content. What matters more — what the models were designed to do, or what they actually do? In addition, this comment reflects a complete misunderstanding of the harms associated with derivative AI output. The danger isn’t limited to specific output that closely resembles specific training data — it’s that the entirety of AI output, trained on creative works without consent, then unfairly competes in marketplaces with the creators of the original works.

We want our AI models to learn from as many languages, cultures, subjects, and industries as possible so they can benefit as many people as possible.”

Me: Okay, I have to seriously end with this one. Isn’t OpenAI so noble in its commitment to cultural diversity and fair representation? This is evil. Is lack of diverse representation in data problematic? It sure is, but merely expanding exploitation of data subjects is not the answer. Let’s truly address data/personal dignity before expanding the reach of corporations holding themselves out as ungovernable, but benevolent, sovereigns.

--

--