AI attribution: a piecemeal approach to data morality in the age of AI

Published in

The Pointy End

6 min readJul 13, 2024

An artist’s illustration of artificial intelligence (AI). This image explores multimodal models. It was created by Twistedpoly as part of the Visualising AI project launched by Google DeepMind.

Much has been said about the friction between content producers and AI training. Large Language Models (LLMs) need to be trained on such a scale that it inherently must rely on data that is available freely on the web. Content producers who publish data on the web have argued that if their content is being used to train AI and deliver commercial benefit to AI companies, then they should be duly compensated. Particularly, if the AI poses a threat to original content producers.

Recently, the New York Times sued OpenAI and Microsoft, alleging that New York Times articles were used to train chatbots like GPT that now compete with the New York Times as a source for reliable information. As yet, no resolution has been reached. The New York Times had originally sought to negotiate commercial agreements with chatbots trained on their content, but could not reach an arrangement.

Although generative AI is relatively new, bursting onto the scene with the launch of ChatGPT in late 2022, we have many existing mental models to help us reason with the business model impacts.

AI as a centralising, sustaining force

It’s funny. You could argue that the two biggest technology paradigms of the last decade have been crypto and AI. These are diametrically opposed in their moral philosophy around data ownership.

The crypto (nee’ Web 3) movement attempts to distribute data ownership to the edges, enabling individuals to have full control over their own data, where it’s used, and how it’s used. This reduces the market power of centralised entities which currently custody and use personal data without compensating users and often against their interests. Crypto is deeply disruptive to the status quo. It’s why we haven’t seen any mainstream use cases from big tech, because it doesn’t conform to incumbent business models. Crypto apps are being built at the edges, from the outside-in.

AI on the other hand is a centralising innovation, which sustains the dominance of big tech. AI aggregates data from the edges (i.e. individuals and content producers), usually without attribution, and centralises the computation and output.

The AI aggregator model

Generative AI has many similarities to the aggregator business models we’ve seen across so many industries before.

Spotify is an aggregator for the music industry, Netflix is an aggregator for the film industry, Apple News is an aggregator for news, and Google is an aggregator for the public web. Aggregators provide a single access point for content and drive traffic to the content. The monetisation models can differ. Netflix pays film studios to licence content and profits on the delta between the licence fees and subscription fees. Google as an aggregator for the public web does not pay to index content, they simply help drive traffic to websites which content producers then monetise themselves. As the diagrams below show, aggregators either pay for content directly or provide traffic that allows content producers to monetise directly.

LLMs on the other hand are far more extractive. Although they train on content, they do not licence the content. But they also don’t offer content producers the ability to monetise their own content. The LLMs simply take it for themselves.

The outputs of LLMs are non-deterministic and varies from the aggregated content used to train, which makes it a legal grey area from a licensing and attribution concern. When you watch a movie on Netflix, that movie is the one that the studio produced meaning a linear attribution and commercial arrangement is necessary (and possible). When an LLM chatbot generates an output, it creates a non-linear regurgitation of its source training data (or multiple sources) which is different in structure and context to what its trained on. It is generative after all. The value chain is a bit murkier.

Localised context content attributions

LLMs have already been trained on unlicensed public data, that ship has sailed. The content producers will sue, some settlements may be paid, and the legal powers that be may inscribe some guardrails for the future. It is highly doubtful though that LLMs already trained will be forced out of the market by legal decree, and it certainly wouldn’t apply to all markets. They’re out there now, and there is too much economic value that is being built around these primitives for it to be feasibly stopped.

That being said, there is still an opportunity to build better attribution standards at a localised context level.

A global context LLM, like ChatGPT is an open-ended LLM which has been trained on public data. It is essentially an engine that can consume any prompt, and can produce any response.

A localised context LLM leverages LLM features but for specific guard-railed use cases. For example, you may have an LLM that traverses an organisation’s wiki/knowledge base, and provides summarised insights. An LLM can read an email you’ve crafted and suggest more succinct verbiage. You could have an LLM that analyses a customer’s history in a CRM and crafts custom AI-generated email marketing comms. These local context LLMs consume specific data and perform AI computations against that data. In these situations, it is possible to attribute the specific data sources that the LLM is ingesting. Retrieval-augmented generation (RAG) can identify the sources from which data is fetched, and enable attribution mechanisms.

I think the LLM use cases that find product-market fit in the short-term will be localised context AIs. These will be AI-features tuned to very specific use cases that users already engage with. If we think about the recent announcement of Apple Intelligence, the majority of announced features were localised context features, such as smart text replies and voice memo summaries.

While global-context LLM chatbots can theoretically do anything, they are prone to hallucinating without the safeguards of a local context and do not fit existing mental models and use cases. When OpenAI first introduced ChatGPT, it was fun to experiment for a little bit, but very few have converted to consistent, active users.

I mentioned crypto at the start of this blog because some of the ideals of the crypto/Web 3 movement are things we should continue to strive for in AI. Web 3 has long advocated for user’s to be compensated for their own data. If your own personal data is being used to target ads at you, why shouldn’t you be compensated for it?

A similar thing could apply for AI-generated content that leverages personal data, for example if someone is sending you AI-generated email marketing that is specifically attuned to your customer profile, we could attribute the output to the customer’s personal data and perhaps compensate them. If someone is building an AI news summariser product which provides an AI-generated summary of the day’s news, then the news sources should be attributed and compensated for providing the source data. Of course, if a news item is widely covered then there may be multiple attributable sources which would share the compensation, whereas an exclusive or hyperlocal news item would be attributable to a single source. This makes sense, the more commoditised news items are, the less they are worth.

If the compensation from an AI-aggregator for using original content is greater than revenue from direct traffic, then it becomes a defensible business model

Conclusion

It’s one for another article, but you have to wonder how much this could all change the architecture of the internet. Localised context AIs have low-error rates compared to hallucinating global context AIs, so it is possible that an AI that summarises online content and attributes and compensates the content producers could be the primary interface for many web-browsing use cases.

The AI industry needs to find ways to achieve content attribution that is mutually beneficially for both AI and content producers. Ultimately, we don’t want to stifle innovation in AI but we want to continue to ensure that content producers can sustainably continue to produce original content. There is a moral obligation to protect content creators. Attributions in local-context AIs is a good starting point.

AI attribution: a piecemeal approach to data morality in the age of AI

AI as a centralising, sustaining force

The AI aggregator model

Localised context content attributions

Conclusion

Written by Jeremy Liu