What is the deal?

John Crabtree

7 min readJul 12, 2024

Large Language AI models and their training data

What do you know about Artificial Intelligence?

How are you using AI?

Is a robot or computer going to take my job?

When does this all lead to the Terminator and Judgement Day?

Artificial Intelligence became a very hot topic last year as ChatGPT went mainstream

While I have yet to write specifically on the topic, today I have something worth sharing

To provide some brief background, these AI chatbots are classed as Large Language Models

These are “a specialized type of artificial intelligence (AI) that has been trained on vast amounts of text to understand existing content and generate original content” — Garnter

(Let’s stick with a simplified explanation)

These are referred to as LLMs

You asked it a text-based question, and it responds with a text-based response

The experience feels something like a cross between instant-messaging with the world’s fastest typist, and having a conversation directly with Google-Search

There are a handful of leaders in the space, and it up for debate who’s model is “best”

To give some background on my own experience, I have been using the paid tiers of Google’s Gemini, Open AI’s Chat GPT, and Anthropic’s Claude as personal assistants for the past year

(I’ve spent time experimenting with Meta’s Llama, which is publicly available)

I have actively been using at least 2 of these models for the last 17 months

I will be honest, for the first year I was enamored with the technology

Yet recently, I find myself quite conflicted with the business practices….

Why did this take so long to become an issue for me?

Maybe it was when I started publishing my own content

One perspective:

I started asking for sources when I used an LLM to brainstorm

When I go down a rabbit-hole, I want to make sure it is based on facts

If I reference anyone else’s work, they deserve credit

Another view:

I began to think about the possibility of my own IP being pulled into training data

Without my permission or consent…

The main question I found myself asking:

Where is all of this data coming from?

Blinded by the speed and ability of these models, this wasn’t something I took the time to consider

“These responses sound right, don’t they? They must have given the original creator some compensation”

Using these LLMs from a different Paradigm, I find myself more and more disturbed by the lack of sourcing

It is the primary reason why I feel uncomfortable accepting anything these models respond with as actual truth or fact

(Yes, I realize each company lists a disclaimer to fact-check — more on that later)

In my experience these models rarely (if ever) cite sources without a direct request that citations are provided

And often, the sources produced do not give actually match with the response itself

This is one of the main reasons I refuse to use anything that is directly generated by an LLM in my writing

It can provide a degree of idea inception, but the writing needs to be my own

While you can produce articles or essays in minutes (which I am sure many teachers are now well aware of) they don’t feel human

Personally, I feel it would be disingenuous to claim this as my own work

The only time you will read anything directly pulled from an LLM, there will be a clear disclaimer

In a previous article, I used Claude’s (Anthropic’s model) response as contrast

Now, I wonder where did the data that generated that response come from?

What were the sources?

Why were they not provided?

It all sounds good, but is it really true?

And even if it is, is this LLM just taking credit for someone else’s work?

Much of the training-data that was used to build these models was taken without a licensing agreement

The leading companies scraped the internet and took as much publicly available data as they could get their hands on

It seems their theory was that “if it is on the public internet, it is fair game”

Then they can use this data to train their models, which produce responses

Perhaps they believe because it was “reproduced by their own method” there is no reason to give the original source any credit

Data is the new Oil

And the internet is filled with it

Right now, ethical and legal considerations are chasing to catch up with technology

Here is something you might want to know

Remember, how I mentioned using Meta’s LLM model

Guess what, it might be using your personal data (via MIT Technology Review):

If you post or interact with chatbots on Facebook, Instagram, Threads, or WhatsApp, Meta can use your data to train its generative AI models beginning June 26, according to its recently updated privacy policy. Even if you don’t use any of Meta’s platforms, it can still scrape data such as photos of you if someone else posts them.

Perhaps, this is a wake-up call

Companies aren’t asking for your permission, they are planning to see if anyone forces them to ask for your forgiveness

We are now starting to see formal licensing deals come to fruition

Perhaps, this is capitulation by content providers to the prospect of licensing fees

Or the response from AI companies to finally being called out on their practices:

A few months ago, some of the leading newspapers such as New York Times and Chicago

Tribune sued Open AI for Copyright infringement — Reuters

News nonprofit sues ChatGPT maker OpenAI and Microsoft for ‘exploitative’ copyright infringement — Associated Press (ironic sourcing given the licensing deal below)

Licensing Deals:

The Atlantic and Vox recently signed licensing deals with Open AI — Axios

OpenAI will use Associated Press news stories to train its models — The Verge

OpenAI inks deal to train AI on Reddit data — Tech Crunch

I can’t help but wonder, how much data already used prior to these deals being announced?

Where does this leave us?

Here is my take….

The reality that the majority of what these models produce is accurate, factual, and easy to believe complicates things

It makes the issues of sourcing and fact-checking even more problematic, as they can seem unnecessary

As a result we are lulled into not questioning the source, the training-data, the validity of the responses that are generated

It almost feels like sourcing is an unneeded complication (the false Paradigm I now see)

I am torn….

My brain says one thing, my heart another

These models are so useful

Yet the more I learn, the more of a moral issue this becomes

I don’t believe it is right to use the work of others without giving them their due credit….

Publishing something on the open internet is not the equivalent as signing-off in agreement that it is fair game to be used as training-data

This practice itself is going to push many publishers, authors, creators to use paywalls to protect their intellectual property, which ends up robbing everyday people of their previous access…

The right solution is to move towards direct citation

And there are signs of hope:

In my experience Chat GPT will include citations to external data when it’s referenced in a response to a user query

The response attempts to include link directly to the source

Yet, it is important to remember that this only happens when the model searches the web to provide that information

When responding with its own training-data, there are no sources

GPT uses the web to find data that it can actually source

Interestingly enough, when I asked Claude to provide citations for its responses, it refused to the give me answers….

Look at how it responds to a question on about a Socrates quote I used referenced yesterday:

At least it seems to point the user in the right direction

But when I asked if Claude could answer any question, I got an interesting reply:

Chat GPT was much more willing to use citations, but often the links did not work

Here is an example:

While there is a disclaimer on the bottom “ChatGPT can make mistakes. Check Important Info”

It does not provide any additional context

Aren’t all responses important?

How are users supposed to filter out hallucinations (incorrect responses where the LLM adamantly believes itself to be correct)?

My goal here is not to try and demonize AI

It is going to continue to make waves in how we live and work, giving humans the capacity to do more with their time

That said, I do think this issue is one worth discussing

There is no rulebook to follow

As I mentioned, I find myself torn

I want these companies to do the right thing

Am I willing to forgo the advantages their technology provides if they don’t?

That is a question we will each have to answer for ourselves

For now, we shall see…..

What do you think?

Should Large Language Models have to cite sources for every response?

Is that something you as a user think is necessary?

Are you currently using any of these models?

If so, what for?

I’d love to hear your opinion on this topic

Thought of the Day: 07–12–2024

What is the deal?

Written by John Crabtree