What is the deal?
Large Language AI models and their training data
What do you know about Artificial Intelligence?
How are you using AI?
Is a robot or computer going to take my job?
When does this all lead to the Terminator and Judgement Day?
Artificial Intelligence became a very hot topic last year as ChatGPT went mainstream
While I have yet to write specifically on the topic, today I have something worth sharing
To provide some brief background, these AI chatbots are classed as Large Language Models
These are “a specialized type of artificial intelligence (AI) that has been trained on vast amounts of text to understand existing content and generate original content” — Garnter
(Let’s stick with a simplified explanation)
These are referred to as LLMs
You asked it a text-based question, and it responds with a text-based response
The experience feels something like a cross between instant-messaging with the world’s fastest typist, and having a conversation directly with Google-Search
There are a handful of leaders in the space, and it up for debate who’s model is “best”
To give some background on my own experience, I have been using the paid tiers of Google’s Gemini, Open AI’s Chat GPT, and Anthropic’s Claude as personal assistants for the past year
(I’ve spent time experimenting with Meta’s Llama, which is publicly available)
I have actively been using at least 2 of these models for the last 17 months
I will be honest, for the first year I was enamored with the technology
Yet recently, I find myself quite conflicted with the business practices….
Why did this take so long to become an issue for me?
Maybe it was when I started publishing my own content
One perspective:
I started asking for sources when I used an LLM to brainstorm
When I go down a rabbit-hole, I want to make sure it is based on facts
If I reference anyone else’s work, they deserve credit
Another view:
I began to think about the possibility of my own IP being pulled into training data
Without my permission or consent…
The main question I found myself asking:
Where is all of this data coming from?
Blinded by the speed and ability of these models, this wasn’t something I took the time to consider
“These responses sound right, don’t they? They must have given the original creator some compensation”
Using these LLMs from a different Paradigm, I find myself more and more disturbed by the lack of sourcing
It is the primary reason why I feel uncomfortable accepting anything these models respond with as actual truth or fact
(Yes, I realize each company lists a disclaimer to fact-check — more on that later)
In my experience these models rarely (if ever) cite sources without a direct request that citations are provided
And often, the sources produced do not give actually match with the response itself
This is one of the main reasons I refuse to use anything that is directly generated by an LLM in my writing
It can provide a degree of idea inception, but the writing needs to be my own
While you can produce articles or essays in minutes (which I am sure many teachers are now well aware of) they don’t feel human
Personally, I feel it would be disingenuous to claim this as my own work
The only time you will read anything directly pulled from an LLM, there will be a clear disclaimer
In a previous article, I used Claude’s (Anthropic’s model) response as contrast
Now, I wonder where did the data that generated that response come from?
What were the sources?
Why were they not provided?
It all sounds good, but is it really true?
And even if it is, is this LLM just taking credit for someone else’s work?
Much of the training-data that was used to build these models was taken without a licensing agreement
The leading companies scraped the internet and took as much publicly available data as they could get their hands on
It seems their theory was that “if it is on the public internet, it is fair game”
Then they can use this data to train their models, which produce responses
Perhaps they believe because it was “reproduced by their own method” there is no reason to give the original source any credit
Data is the new Oil
And the internet is filled with it
Right now, ethical and legal considerations are chasing to catch up with technology
Here is something you might want to know
Remember, how I mentioned using Meta’s LLM model
Guess what, it might be using your personal data (via MIT Technology Review):
If you post or interact with chatbots on Facebook, Instagram, Threads, or WhatsApp, Meta can use your data to train its generative AI models beginning June 26, according to its recently updated privacy policy. Even if you don’t use any of Meta’s platforms, it can still scrape data such as photos of you if someone else posts them.
Perhaps, this is a wake-up call
Companies aren’t asking for your permission, they are planning to see if anyone forces them to ask for your forgiveness
We are now starting to see formal licensing deals come to fruition
Perhaps, this is capitulation by content providers to the prospect of licensing fees
Or the response from AI companies to finally being called out on their practices:
A few months ago, some of the leading newspapers such as New York Times and Chicago
Tribune sued Open AI for Copyright infringement — Reuters
News nonprofit sues ChatGPT maker OpenAI and Microsoft for ‘exploitative’ copyright infringement — Associated Press (ironic sourcing given the licensing deal below)
Licensing Deals:
The Atlantic and Vox recently signed licensing deals with Open AI — Axios
OpenAI will use Associated Press news stories to train its models — The Verge
OpenAI inks deal to train AI on Reddit data — Tech Crunch
I can’t help but wonder, how much data already used prior to these deals being announced?
Where does this leave us?
Here is my take….
The reality that the majority of what these models produce is accurate, factual, and easy to believe complicates things
It makes the issues of sourcing and fact-checking even more problematic, as they can seem unnecessary
As a result we are lulled into not questioning the source, the training-data, the validity of the responses that are generated
It almost feels like sourcing is an unneeded complication (the false Paradigm I now see)
I am torn….
My brain says one thing, my heart another
These models are so useful
Yet the more I learn, the more of a moral issue this becomes
I don’t believe it is right to use the work of others without giving them their due credit….
Publishing something on the open internet is not the equivalent as signing-off in agreement that it is fair game to be used as training-data
This practice itself is going to push many publishers, authors, creators to use paywalls to protect their intellectual property, which ends up robbing everyday people of their previous access…
The right solution is to move towards direct citation
And there are signs of hope:
In my experience Chat GPT will include citations to external data when it’s referenced in a response to a user query
The response attempts to include link directly to the source
Yet, it is important to remember that this only happens when the model searches the web to provide that information
When responding with its own training-data, there are no sources
GPT uses the web to find data that it can actually source
Interestingly enough, when I asked Claude to provide citations for its responses, it refused to the give me answers….
Look at how it responds to a question on about a Socrates quote I used referenced yesterday:
At least it seems to point the user in the right direction
But when I asked if Claude could answer any question, I got an interesting reply:
Chat GPT was much more willing to use citations, but often the links did not work
Here is an example:
While there is a disclaimer on the bottom “ChatGPT can make mistakes. Check Important Info”
It does not provide any additional context
Aren’t all responses important?
How are users supposed to filter out hallucinations (incorrect responses where the LLM adamantly believes itself to be correct)?
My goal here is not to try and demonize AI
It is going to continue to make waves in how we live and work, giving humans the capacity to do more with their time
That said, I do think this issue is one worth discussing
There is no rulebook to follow
As I mentioned, I find myself torn
I want these companies to do the right thing
Am I willing to forgo the advantages their technology provides if they don’t?
That is a question we will each have to answer for ourselves
For now, we shall see…..
What do you think?
Should Large Language Models have to cite sources for every response?
Is that something you as a user think is necessary?
Are you currently using any of these models?
If so, what for?
I’d love to hear your opinion on this topic
Thought of the Day: 07–12–2024